2025-05-06

Title: Multi-party Collaborative Attention Control for Image Customization

Authors: Han Yang, Chuanguang Yang, Qiuli Wang, Zhulin An, Weilun Feng, Libo Huang, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01428
Pdf URL: https://arxiv.org/pdf/2505.01428
Copy Paste: [[2505.01428]] Multi-party Collaborative Attention Control for Image Customization(https://arxiv.org/abs/2505.01428)
Keywords: diffusion
Abstract: The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.

Title: Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis

Authors: Md. Zahid Hossain, Md. Rakibul Islam, Most. Sharmin Sultana Samu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01429
Pdf URL: https://arxiv.org/pdf/2505.01429
Copy Paste: [[2505.01429]] Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis(https://arxiv.org/abs/2505.01429)
Keywords: transformer
Abstract: Since mpox can spread from person to person, it is a zoonotic viral illness that poses a significant public health concern. It is difficult to make an early clinical diagnosis because of how closely its symptoms match those of measles and chickenpox. Medical imaging combined with deep learning (DL) techniques has shown promise in improving disease detection by analyzing affected skin areas. Our study explore the feasibility to train deep learning and vision transformer-based models from scratch with publicly available skin lesion image dataset. Our experimental results show dataset limitation as a major drawback to build better classifier models trained from scratch. We used transfer learning with the help of pre-trained models to get a better classifier. The MobileNet-v2 outperformed other state of the art pre-trained models with 93.15% accuracy and 93.09% weighted average F1 score. ViT B16 and ResNet-50 also achieved satisfactory performance compared to already available studies with accuracy 92.12% and 86.21% respectively. To further validate the performance of the models, we applied explainable AI techniques.

Title: Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models

Authors: Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01430
Pdf URL: https://arxiv.org/pdf/2505.01430
Copy Paste: [[2505.01430]] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models(https://arxiv.org/abs/2505.01430)
Keywords: fair, diffusion, generative
Abstract: The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.

Title: ZS-VCOS: Zero-Shot Outperforms Supervised Video Camouflaged Object Segmentation

Authors: Wenqi Guo, Shan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01431
Pdf URL: https://arxiv.org/pdf/2505.01431
Copy Paste: [[2505.01431]] ZS-VCOS: Zero-Shot Outperforms Supervised Video Camouflaged Object Segmentation(https://arxiv.org/abs/2505.01431)
Keywords: segmentation
Abstract: Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, likely due to the similarity of the camouflaged object and the background. Optical flow, commonly utilized for detecting moving objects, has demonstrated effectiveness even with camouflaged entities. Our method integrates optical flow, a vision-language model, and SAM 2 into a sequential pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ($F_\beta^w$) from 0.296 to 0.628. Remarkably, our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. More details can be found on this https URL.

Title: Firewall Regulatory Networks for Autonomous Cyber Defense

Authors: Qi Duan, Ehab Al-Shaer
Subjects: cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2505.01436
Pdf URL: https://arxiv.org/pdf/2505.01436
Copy Paste: [[2505.01436]] Firewall Regulatory Networks for Autonomous Cyber Defense(https://arxiv.org/abs/2505.01436)
Keywords: defense
Abstract: In this paper, we present the principles of designing new self-organising and autonomous management protocol to govern the dynamics of bio-inspired decentralized firewall architecture based on Biological Regularity Networks. The new architecture called Firewall Regulatory Networks (FRN) exhibits the following features (1) automatic rule policy configuration with provable utility-risk appetite guarantee, (2) resilient response for changing risks or new service requirements, and (3) globally optimized access control policy reconciliation. We present the FRN protocol and formalize the constraints to synthesize the undetermined components in the protocol to produce interactions that can achieve these objectives. We illustrate the feasibility of the FRN architecture in multiple case studies.

Title: Enhancing IoT-Botnet Detection using Variational Auto-encoder and Cost-Sensitive Learning: A Deep Learning Approach for Imbalanced Datasets

Authors: Hassan Wasswa, Timothy Lynar, Hussein Abbass
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01437
Pdf URL: https://arxiv.org/pdf/2505.01437
Copy Paste: [[2505.01437]] Enhancing IoT-Botnet Detection using Variational Auto-encoder and Cost-Sensitive Learning: A Deep Learning Approach for Imbalanced Datasets(https://arxiv.org/abs/2505.01437)
Keywords: attack
Abstract: The Internet of Things (IoT) technology has rapidly gained popularity with applications widespread across a variety of industries. However, IoT devices have been recently serving as a porous layer for many malicious attacks to both personal and enterprise information systems with the most famous attacks being botnet-related attacks. The work in this study leveraged Variational Auto-encoder (VAE) and cost-sensitive learning to develop lightweight, yet effective, models for IoT-botnet detection. The aim is to enhance the detection of minority class attack traffic instances which are often missed by machine learning models. The proposed approach is evaluated on a multi-class problem setting for the detection of traffic categories on highly imbalanced datasets. The performance of two deep learning models including the standard feed forward deep neural network (DNN), and Bidirectional-LSTM (BLSTM) was evaluated and both recorded commendable results in terms of accuracy, precision, recall and F1-score for all traffic classes.

Title: Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials

Authors: Tengfei Xing, Xiaodan Ren, Jie Li
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01438
Pdf URL: https://arxiv.org/pdf/2505.01438
Copy Paste: [[2505.01438]] Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials(https://arxiv.org/abs/2505.01438)
Keywords: diffusion
Abstract: Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.

Title: Explainable AI for Correct Root Cause Analysis of Product Quality in Injection Moulding

Authors: Muhammad Muaz, Sameed Sajid, Tobias Schulze, Chang Liu, Nils Klasen, Benny Drescher
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01445
Pdf URL: https://arxiv.org/pdf/2505.01445
Copy Paste: [[2505.01445]] Explainable AI for Correct Root Cause Analysis of Product Quality in Injection Moulding(https://arxiv.org/abs/2505.01445)
Keywords: explainability
Abstract: If a product deviates from its desired properties in the injection moulding process, its root cause analysis can be aided by models that relate the input machine settings with the output quality characteristics. The machine learning models tested in the quality prediction are mostly black boxes; therefore, no direct explanation of their prognosis is given, which restricts their applicability in the quality control. The previously attempted explainability methods are either restricted to tree-based algorithms only or do not emphasize on the fact that some explainability methods can lead to wrong root cause identification of a product's deviation from its desired properties. This study first shows that the interactions among the multiple input machine settings do exist in real experimental data collected as per a central composite design. Then, the model-agnostic explainable AI methods are compared for the first time to show that different explainability methods indeed lead to different feature impact analysis in injection moulding. Moreover, it is shown that the better feature attribution translates to the correct cause identification and actionable insights for the injection moulding process. Being model agnostic, explanations on both random forest and multilayer perceptron are performed for the cause analysis, as both models have the mean absolute percentage error of less than 0.05% on the experimental dataset.

Title: OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Authors: Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.01448
Pdf URL: https://arxiv.org/pdf/2505.01448
Copy Paste: [[2505.01448]] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models(https://arxiv.org/abs/2505.01448)
Keywords: segmentation
Abstract: Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.

Title: COSMOS: Predictable and Cost-Effective Adaptation of LLMs

Authors: Jiayu Wang, Aws Albarghouthi, Frederic Sala
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01449
Pdf URL: https://arxiv.org/pdf/2505.01449
Copy Paste: [[2505.01449]] COSMOS: Predictable and Cost-Effective Adaptation of LLMs(https://arxiv.org/abs/2505.01449)
Keywords: large language model
Abstract: Large language models (LLMs) achieve remarkable performance across numerous tasks by using a diverse array of adaptation strategies. However, optimally selecting a model and adaptation strategy under resource constraints is challenging and often requires extensive experimentation. We investigate whether it is possible to accurately predict both performance and cost without expensive trials. We formalize the strategy selection problem for LLMs and introduce COSMOS, a unified prediction framework that efficiently estimates adaptation outcomes at minimal cost. We instantiate and study the capability of our framework via a pair of powerful predictors: embedding-augmented lightweight proxy models to predict fine-tuning performance, and low-sample scaling laws to forecast retrieval-augmented in-context learning. Extensive evaluation across eight representative benchmarks demonstrates that COSMOS achieves high prediction accuracy while reducing computational costs by 92.72% on average, and up to 98.71% in resource-intensive scenarios. Our results show that efficient prediction of adaptation outcomes is not only feasible but can substantially reduce the computational overhead of LLM deployment while maintaining performance standards.

Title: Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks

Authors: Chaoyi Wang, Junjie Zheng, Zihao Chen, Shiyu Xia, Chaofan Ding, Xiaohao Zhang, Xi Tao, Xiaoming He, Xinhan Di
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01450
Pdf URL: https://arxiv.org/pdf/2505.01450
Copy Paste: [[2505.01450]] Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks(https://arxiv.org/abs/2505.01450)
Keywords: large language model
Abstract: Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the complexities of dialogue, narration, monologue, and actor adaptability in movie dubbing. 2) A practical evaluation system should offer valuable insights to improve movie dubbing quality and advancement in film production. To this end, we introduce Talking Adaptive Dubbing Benchmarks (TA-Dubbing), designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing. TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation. 2) Versatile Benchmarking: TA-Dubbing is designed to evaluate state-of-the-art movie dubbing models and advanced multi-modal large language models. 3) Full Open-Sourcing: We fully open-source TA-Dubbing at this https URL 0a/DeepDubber- V1 including all video suits, evaluation methods, annotations. We also continuously integrate new movie dubbing models into the TA-Dubbing leaderboard at this https URL 0a/DeepDubber-V1 to drive forward the field of movie dubbing.

Title: Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning

Authors: Zhiyong Jin, Runhua Xu, Chao Li, Yizhong Liu, Jianxin Li
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01454
Pdf URL: https://arxiv.org/pdf/2505.01454
Copy Paste: [[2505.01454]] Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning(https://arxiv.org/abs/2505.01454)
Keywords: security, privacy, defense, attack, federate
Abstract: Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet it faces significant challenges in communication efficiency and vulnerability to poisoning attacks. While sparsification techniques mitigate communication overhead by transmitting only critical model parameters, they inadvertently amplify security risks: adversarial clients can exploit sparse updates to evade detection and degrade model performance. Existing defense mechanisms, designed for standard FL communication scenarios, are ineffective in addressing these vulnerabilities within sparsified FL. To bridge this gap, we propose FLARE, a novel federated learning framework that integrates sparse index mask inspection and model update sign similarity analysis to detect and mitigate poisoning attacks in sparsified FL. Extensive experiments across multiple datasets and adversarial scenarios demonstrate that FLARE significantly outperforms existing defense strategies, effectively securing sparsified FL against poisoning attacks while maintaining communication efficiency.

Title: Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Authors: Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.01456
Pdf URL: https://arxiv.org/pdf/2505.01456
Copy Paste: [[2505.01456]] Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation(https://arxiv.org/abs/2505.01456)
Keywords: defense, attack, robust, interpretability
Abstract: LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

Title: MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01459
Pdf URL: https://arxiv.org/pdf/2505.01459
Copy Paste: [[2505.01459]] MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling(https://arxiv.org/abs/2505.01459)
Keywords: robust, large language model
Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.

Title: Development of an Adapter for Analyzing and Protecting Machine Learning Models from Competitive Activity in the Networks Services

Authors: Denis Parfenov, Anton Parfenov
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01460
Pdf URL: https://arxiv.org/pdf/2505.01460
Copy Paste: [[2505.01460]] Development of an Adapter for Analyzing and Protecting Machine Learning Models from Competitive Activity in the Networks Services(https://arxiv.org/abs/2505.01460)
Keywords: protect, attack
Abstract: Due to the increasing number of tasks that are solved on remote servers, identifying and classifying traffic is an important task to reduce the load on the server. There are various methods for classifying traffic. This paper discusses machine learning models for solving this problem. However, such ML models are also subject to attacks that affect the classification result of network traffic. To protect models, we proposed a solution based on an autoencoder

Title: Enhancing the Cloud Security through Topic Modelling

Authors: Sabbir M. Saleh, Nazim Madhavji, John Steinbacher
Subjects: cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2505.01463
Pdf URL: https://arxiv.org/pdf/2505.01463
Copy Paste: [[2505.01463]] Enhancing the Cloud Security through Topic Modelling(https://arxiv.org/abs/2505.01463)
Keywords: security, protect, attack
Abstract: Protecting cloud applications is crucial in an age where security constantly threatens the digital world. The inevitable cyber-attacks throughout the CI/CD pipeline make cloud security innovations necessary. This research is motivated by applying Natural Language Processing (NLP) methodologies, such as Topic Modelling, to analyse cloud security data and predict future attacks. This research aims to use topic modelling, specifically Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (pLSA). Utilising LDA and PLSA, security-related text data, such as reports, logs, and other relevant documents, will be analysed and sorted into relevant topics (such as phishing or encryption). These algorithms may apply through Python using the Gensim framework. The topics shall be utilised to detect vulnerabilities within relevant CI/CD pipeline records or log data. This application of Topic Modelling anticipates providing a new form of vulnerability detection, improving overall security throughout the CI/CD pipeline.

Title: SafeTab-P: Disclosure Avoidance for the 2020 Census Detailed Demographic and Housing Characteristics File A (Detailed DHC-A)

Authors: Sam Haney, Skye Berghel, Bayard Carlson, Ryan Cumings-Menon, Luke Hartman, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Amritha Pai, Simran Rajpal, David Pujol, William Sexton, Ruchit Shrestha, Daniel Simmons-Marengo
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2505.01472
Pdf URL: https://arxiv.org/pdf/2505.01472
Copy Paste: [[2505.01472]] SafeTab-P: Disclosure Avoidance for the 2020 Census Detailed Demographic and Housing Characteristics File A (Detailed DHC-A)(https://arxiv.org/abs/2505.01472)
Keywords: privacy, protect
Abstract: This article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the Detailed Demographic and Housing Characteristics File A (Detailed DHC-A) of the 2020 Census. The tabulations contain statistics (counts) of demographic characteristics of the entire population of the United States, crossed with detailed races and ethnicities at varying levels of geography. The article describes the SafeTab-P algorithm, which is based on adding noise drawn to statistics of interest from a discrete Gaussian distribution. A key innovation in SafeTab-P is the ability to adaptively choose how many statistics and at what granularity to release them, depending on the size of a population group. We prove that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy (zCDP). We then describe how the algorithm was implemented on Tumult Analytics and briefly outline the parameterization and tuning of the algorithm.

Title: Watermark Overwriting Attack on StegaStamp algorithm

Authors: I.F.Serzhenko, L.A.Khaertdinova, M.A.Pautov, A.V.Antsiferova
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01474
Pdf URL: https://arxiv.org/pdf/2505.01474
Copy Paste: [[2505.01474]] Watermark Overwriting Attack on StegaStamp algorithm(https://arxiv.org/abs/2505.01474)
Keywords: attack, watermark
Abstract: This paper presents an attack method on the StegaStamp watermarking algorithm that completely removes watermarks from an image with minimal quality loss, developed as part of the NeurIPS "Erasing the invisible" competition.

Title: SymPlanner: Deliberate Planning in Language Models with Symbolic Representation

Authors: Siheng Xiong, Jieyu Zhou, Zhangding Liu, Yusen Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01479
Pdf URL: https://arxiv.org/pdf/2505.01479
Copy Paste: [[2505.01479]] SymPlanner: Deliberate Planning in Language Models with Symbolic Representation(https://arxiv.org/abs/2505.01479)
Keywords: robust
Abstract: Planning remains a core challenge for language models (LMs), particularly in domains that require coherent multi-step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained comparison of candidate plans by evaluating them jointly. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.

Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Authors: Zongxia Li, Xiyang Wu, Yubin Qin, Guangyao Shi, Hongyang Du, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01481
Pdf URL: https://arxiv.org/pdf/2505.01481
Copy Paste: [[2505.01481]] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos(https://arxiv.org/abs/2505.01481)
Keywords: interpretability, large language model
Abstract: Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at this https URL.

Title: LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps

Authors: Pedro Abdalla, Roman Vershynin
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01484
Pdf URL: https://arxiv.org/pdf/2505.01484
Copy Paste: [[2505.01484]] LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps(https://arxiv.org/abs/2505.01484)
Keywords: watermark, large language model
Abstract: Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? A widely studied approach to this problem is watermarking. We propose an undetectable and elementary watermarking scheme in the closed setting. Also, in the harder open setting, where the adversary has access to most of the model, we propose an unremovable watermarking scheme.

Title: Explainable Machine Learning for Cyberattack Identification from Traffic Flows

Authors: Yujing Zhou, Marc L. Jacquet, Robel Dawit, Skyler Fabre, Dev Sarawat, Faheem Khan, Madison Newell, Yongxin Liu, Dahai Liu, Hongyun Chen, Jian Wang, Huihui Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.01488
Pdf URL: https://arxiv.org/pdf/2505.01488
Copy Paste: [[2505.01488]] Explainable Machine Learning for Cyberattack Identification from Traffic Flows(https://arxiv.org/abs/2505.01488)
Keywords: security, defense, attack, steal, interpretability
Abstract: The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using a virtualized traffic network to analyze disruption patterns. We develop a deep learning-based anomaly detection system, demonstrating that Longest Stop Duration and Total Jam Distance are key indicators of compromised signals. To enhance interpretability, we apply Explainable AI (XAI) techniques, identifying critical decision factors and diagnosing misclassification errors. Our analysis reveals two primary challenges: transitional data inconsistencies, where mislabeled recovery-phase traffic misleads the model, and model limitations, where stealth attacks in low-traffic conditions evade detection. This work enhances AI-driven traffic security, improving both detection accuracy and trustworthiness in smart transportation systems.

Title: Machine Learning for Cyber-Attack Identification from Traffic Flows

Authors: Yujing Zhou, Marc L. Jacquet, Robel Dawit, Skyler Fabre, Dev Sarawat, Faheem Khan, Madison Newell, Yongxin Liu, Dahai Liu, Hongyun Chen, Jian Wang, Huihui Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.01489
Pdf URL: https://arxiv.org/pdf/2505.01489
Copy Paste: [[2505.01489]] Machine Learning for Cyber-Attack Identification from Traffic Flows(https://arxiv.org/abs/2505.01489)
Keywords: attack
Abstract: This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In this research, the cyber attacks are focused particularly when lights are randomly turned all green or red at busy intersections by adversarial attackers. Despite challenges stemming from imbalanced data and overlapping traffic patterns, our best model shows 85\% accuracy when detecting intrusions purely using traffic flow statistics. Key indicators for successful detection included occupancy, jam length, and halting durations.

Title: WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

Authors: Daoan Zhang, Che Jiang, Ruoshi Xu, Biaoxiang Chen, Zijian Jin, Yutian Lu, Jianguo Zhang, Liang Yong, Jiebo Luo, Shengda Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01490
Pdf URL: https://arxiv.org/pdf/2505.01490
Copy Paste: [[2505.01490]] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation(https://arxiv.org/abs/2505.01490)
Keywords: diffusion
Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \href{this https URL}{this https URL}

Title: Securing the Future of IVR: AI-Driven Innovation with Agile Security, Data Regulation, and Ethical AI Integration

Authors: Khushbu Mehboob Shaikh, Georgios Giannakopoulos
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.01514
Pdf URL: https://arxiv.org/pdf/2505.01514
Copy Paste: [[2505.01514]] Securing the Future of IVR: AI-Driven Innovation with Agile Security, Data Regulation, and Ethical AI Integration(https://arxiv.org/abs/2505.01514)
Keywords: secure, security, privacy
Abstract: The rapid digitalization of communication systems has elevated Interactive Voice Response (IVR) technologies to become critical interfaces for customer engagement. With Artificial Intelligence (AI) now driving these platforms, ensuring secure, compliant, and ethically designed development practices is more imperative than ever. AI-powered IVRs leverage Natural Language Processing (NLP) and Machine Learning (ML) to personalize interactions, automate service delivery, and optimize user experiences. However, these innovations expose systems to heightened risks, including data privacy breaches, AI decision opacity, and model security vulnerabilities. This paper analyzes the evolution of IVRs from static code-based designs to adaptive AI-driven systems, presenting a cybersecurity-centric perspective. We propose a practical governance framework that embeds agile security principles, compliance with global data legislation, and user-centric ethics. Emphasizing privacy-by-design, adaptive risk modeling, and transparency, the paper argues that ethical AI integration is not a feature but a strategic imperative. Through this multidimensional lens, we highlight how modern IVRs can transition from communication tools to intelligent, secure, and accountable digital frontlines-resilient against emerging threats and aligned with societal expectations.

Title: Rubber Mallet: A Study of High Frequency Localized Bit Flips and Their Impact on Security

Authors: Andrew Adiletta, Zane Weissman, Fatemeh Khojasteh Dana, Berk Sunar, Shahin Tajik
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01518
Pdf URL: https://arxiv.org/pdf/2505.01518
Copy Paste: [[2505.01518]] Rubber Mallet: A Study of High Frequency Localized Bit Flips and Their Impact on Security(https://arxiv.org/abs/2505.01518)
Keywords: security, protect, defense, attack, large language model
Abstract: The increasing density of modern DRAM has heightened its vulnerability to Rowhammer attacks, which induce bit flips by repeatedly accessing specific memory rows. This paper presents an analysis of bit flip patterns generated by advanced Rowhammer techniques that bypass existing hardware defenses. First, we investigate the phenomenon of adjacent bit flips--where two or more physically neighboring bits are corrupted simultaneously--and demonstrate they occur with significantly higher frequency than previously documented. We also show that if multiple bits flip within a byte, they are more likely to be adjacent than randomly distributed: for example, if 4 bits flip within a byte, there is an 87% chance that they are all adjacent. We also demonstrate that bit flips within a row will naturally cluster together likely due to the underlying physics of the attack. We then investigate two fault injection attacks enabled by multiple adjacent or nearby bit flips. First, we show how these correlated flips enable efficient cryptographic signature correction attacks, successfully recovering ECDSA private keys from OpenSSL implementations where single-bit approaches would be unfeasible. Second, we introduce a targeted attack against large language models by exploiting Rowhammer-induced corruptions in tokenizer dictionaries of GGUF model files. This attack effectively rewrites safety instructions in system prompts by swapping safety-critical tokens with benign alternatives, circumventing model guardrails while maintaining normal functionality in other contexts. Our experimental results across multiple DRAM configurations reveal that current memory protection schemes are inadequate against these sophisticated attack vectors, which can achieve their objectives with precise, minimal modifications rather than random corruption.

Title: Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

Authors: Madhav Kotecha, Vijendra Kumar Vaishya, Smita Gautam, Suraj Racha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01523
Pdf URL: https://arxiv.org/pdf/2505.01523
Copy Paste: [[2505.01523]] Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation(https://arxiv.org/abs/2505.01523)
Keywords: large language model
Abstract: We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.

Title: The DCR Delusion: Measuring the Privacy Risk of Synthetic Data

Authors: Zexi Yao, Nataša Krčo, Georgi Ganev, Yves-Alexandre de Montjoye
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01524
Pdf URL: https://arxiv.org/pdf/2505.01524
Copy Paste: [[2505.01524]] The DCR Delusion: Measuring the Privacy Risk of Synthetic Data(https://arxiv.org/abs/2505.01524)
Keywords: privacy, attack, membership infer, diffusion
Abstract: Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synthetic dataset, practitioners and researchers often rely on simpler proxy metrics such as Distance to Closest Record (DCR). These metrics estimate privacy by measuring the similarity between the training data and generated synthetic data. This similarity is also compared against that between the training data and a disjoint holdout set of real records to construct a binary privacy test. If the synthetic data is not more similar to the training data than the holdout set is, it passes the test and is considered private. In this work we show that, while computationally inexpensive, DCR and other distance-based metrics fail to identify privacy leakage. Across multiple datasets and both classical models such as Baynet and CTGAN and more recent diffusion models, we show that datasets deemed private by proxy metrics are highly vulnerable to MIAs. We similarly find both the binary privacy test and the continuous measure based on these metrics to be uninformative of actual membership inference risk. We further show that these failures are consistent across different metric hyperparameter settings and record selection methods. Finally, we argue DCR and other distance-based metrics to be flawed by design and show a example of a simple leakage they miss in practice. With this work, we hope to motivate practitioners to move away from proxy metrics to MIAs as the rigorous, comprehensive standard of evaluating privacy of synthetic data, in particular to make claims of datasets being legally anonymous.

Title: Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Authors: Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Jun Ming Tan, Wenhe Feng, Seung Ki Moon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01530
Pdf URL: https://arxiv.org/pdf/2505.01530
Copy Paste: [[2505.01530]] Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer(https://arxiv.org/abs/2505.01530)
Keywords: extraction, transformer
Abstract: Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is time-consuming and error-prone, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an oriented bounding box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GD&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GD&T), recall (100% for most), and F1 score (97.3%), while reducing hallucination (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.

Title: Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation

Authors: Zhen Yao, Xiaowen Ying, Mooi Choo Chuah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01548
Pdf URL: https://arxiv.org/pdf/2505.01548
Copy Paste: [[2505.01548]] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation(https://arxiv.org/abs/2505.01548)
Keywords: segmentation
Abstract: Event cameras capture motion dynamics, offering a unique modality with great potential in various computer vision tasks. However, RGB-Event fusion faces three intrinsic misalignments: (i) temporal, (ii) spatial, and (iii) modal misalignment. Existing voxel grid representations neglect temporal correlations between consecutive event windows, and their formulation with simple accumulation of asynchronous and sparse events is incompatible with the synchronous and dense nature of RGB modality. To tackle these challenges, we propose a novel event representation, Motion-enhanced Event Tensor (MET), which transforms sparse event voxels into a dense and temporally coherent form by leveraging dense optical flows and event temporal features. In addition, we introduce a Frequency-aware Bidirectional Flow Aggregation Module (BFAM) and a Temporal Fusion Module (TFM). BFAM leverages the frequency domain and MET to mitigate modal misalignment, while bidirectional flow aggregation and temporal fusion mechanisms resolve spatiotemporal misalignment. Experimental results on two large-scale datasets demonstrate that our framework significantly outperforms state-of-the-art RGB-Event semantic segmentation approaches. Our code is available at: this https URL.

Title: A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning

Authors: Anan Yaghmour, Melba M. Crawford, Saurabh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01558
Pdf URL: https://arxiv.org/pdf/2505.01558
Copy Paste: [[2505.01558]] A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning(https://arxiv.org/abs/2505.01558)
Keywords: generative, segmentation
Abstract: Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.

Title: On the effectiveness of Large Language Models in the mechanical design domain

Authors: Daniele Grandi, Fabian Riquelme
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01559
Pdf URL: https://arxiv.org/pdf/2505.01559
Copy Paste: [[2505.01559]] On the effectiveness of Large Language Models in the mechanical design domain(https://arxiv.org/abs/2505.01559)
Keywords: large language model
Abstract: In this work, we seek to understand the performance of large language models in the mechanical engineering domain. We leverage the semantic data found in the ABC dataset, specifically the assembly names that designers assigned to the overall assemblies, and the individual semantic part names that were assigned to each part. After pre-processing the data we developed two unsupervised tasks to evaluate how different model architectures perform on domain-specific data: a binary sentence-pair classification task and a zero-shot classification task. We achieved a 0.62 accuracy for the binary sentence-pair classification task with a fine-tuned model that focuses on fighting over-fitting: 1) modifying learning rates, 2) dropout values, 3) Sequence Length, and 4) adding a multi-head attention layer. Our model on the zero-shot classification task outperforms the baselines by a wide margin, and achieves a top-1 classification accuracy of 0.386. The results shed some light on the specific failure modes that arise when learning from language in this domain.

Title: AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

Authors: Vicent Briva Iglesias, Gokhan Dogru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01560
Pdf URL: https://arxiv.org/pdf/2505.01560
Copy Paste: [[2505.01560]] AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains(https://arxiv.org/abs/2505.01560)
Keywords: large language model
Abstract: Large language models (LLMs) and multi-agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT-4o (general-purpose LLM), o1-preview (reasoning-enhanced LLM), and two GPT-4o-powered agentic workflows (sequential three-stage and iterative refinement), on test data drawn from a legal contract and news prose in three English-source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input-plus-output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric-language combinations; o1-preview ties or places second in most remaining cases, while both multi-agent workflows trail. Human evaluation reverses part of this narrative: o1-preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single-pass LLMs. We advocate multidimensional, cost-aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single-pass LLMs with targeted agent intervention.

Title: PainFormer: a Vision Foundation Model for Automatic Pain Assessment

Authors: Stefanos Gkikas, Raul Fernandez Rojas, Manolis Tsiknakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01571
Pdf URL: https://arxiv.org/pdf/2505.01571
Copy Paste: [[2505.01571]] PainFormer: a Vision Foundation Model for Automatic Pain Assessment(https://arxiv.org/abs/2505.01571)
Keywords: transformer
Abstract: Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities-including RGB, synthetic thermal, and estimated depth videos-and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 73 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment.

Title: TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Authors: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01583
Pdf URL: https://arxiv.org/pdf/2505.01583
Copy Paste: [[2505.01583]] TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action(https://arxiv.org/abs/2505.01583)
Keywords: segmentation
Abstract: Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.

Title: Understanding and Exploiting Plasticity for Non-stationary Network Resource Adaptation

Authors: Zhiqiang He, Zhi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01584
Pdf URL: https://arxiv.org/pdf/2505.01584
Copy Paste: [[2505.01584]] Understanding and Exploiting Plasticity for Non-stationary Network Resource Adaptation(https://arxiv.org/abs/2505.01584)
Keywords: robust
Abstract: Adapting to non-stationary network conditions presents significant challenges for resource adaptation. However, current solutions primarily rely on stationary assumptions. While data-driven reinforcement learning approaches offer promising solutions for handling network dynamics, our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to evolving network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.

Title: Machine Learning Fairness in House Price Prediction: A Case Study of America's Expanding Metropolises

Authors: Abdalwahab Almajed, Maryam Tabar, Peyman Najafirad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01591
Pdf URL: https://arxiv.org/pdf/2505.01591
Copy Paste: [[2505.01591]] Machine Learning Fairness in House Price Prediction: A Case Study of America's Expanding Metropolises(https://arxiv.org/abs/2505.01591)
Keywords: protect, fair
Abstract: As a basic human need, housing plays a key role in enhancing health, well-being, and educational outcome in society, and the housing market is a major factor for promoting quality of life and ensuring social equity. To improve the housing conditions, there has been extensive research on building Machine Learning (ML)-driven house price prediction solutions to accurately forecast the future conditions, and help inform actions and policies in the field. In spite of their success in developing high-accuracy models, there is a gap in our understanding of the extent to which various ML-driven house price prediction approaches show ethnic and/or racial bias, which in turn is essential for the responsible use of ML, and ensuring that the ML-driven solutions do not exacerbate inequity. To fill this gap, this paper develops several ML models from a combination of structural and neighborhood-level attributes, and conducts comprehensive assessments on the fairness of ML models under various definitions of privileged groups. As a result, it finds that the ML-driven house price prediction models show various levels of bias towards protected attributes (i.e., race and ethnicity in this study). Then, it investigates the performance of different bias mitigation solutions, and the experimental results show their various levels of effectiveness on different ML-driven methods. However, in general, the in-processing bias mitigation approach tends to be more effective than the pre-processing one in this problem domain. Our code is available at this https URL.

Title: PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

Authors: Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01592
Pdf URL: https://arxiv.org/pdf/2505.01592
Copy Paste: [[2505.01592]] PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents(https://arxiv.org/abs/2505.01592)
Keywords: large language model
Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.

Title: Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Authors: Liaoyaqi Wang, Zhengping Jiang, Anqi Liu, Benjamin Van Durme
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01595
Pdf URL: https://arxiv.org/pdf/2505.01595
Copy Paste: [[2505.01595]] Always Tell Me The Odds: Fine-grained Conditional Probability Estimation(https://arxiv.org/abs/2505.01595)
Keywords: large language model
Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.

Title: Multimodal and Multiview Deep Fusion for Autonomous Marine Navigation

Authors: Dimitrios Dagdilelis, Panagiotis Grigoriadis, Roberto Galeazzi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01615
Pdf URL: https://arxiv.org/pdf/2505.01615
Copy Paste: [[2505.01615]] Multimodal and Multiview Deep Fusion for Autonomous Marine Navigation(https://arxiv.org/abs/2505.01615)
Keywords: robust, transformer
Abstract: We propose a cross attention transformer based method for multimodal sensor fusion to build a birds eye view of a vessels surroundings supporting safer autonomous marine navigation. The model deeply fuses multiview RGB and long wave infrared images with sparse LiDAR point clouds. Training also integrates X band radar and electronic chart data to inform predictions. The resulting view provides a detailed reliable scene representation improving navigational accuracy and robustness. Real world sea trials confirm the methods effectiveness even in adverse weather and complex maritime settings.

Title: Don't be lazy: CompleteP enables compute-efficient deep transformers

Authors: Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01618
Pdf URL: https://arxiv.org/pdf/2505.01618
Copy Paste: [[2505.01618]] Don't be lazy: CompleteP enables compute-efficient deep transformers(https://arxiv.org/abs/2505.01618)
Keywords: transformer
Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the unique parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34\% compute efficiency improvements over the prior state-of-the-art.

Title: A Domain Adaptation of Large Language Models for Classifying Mechanical Assembly Components

Authors: Fatemeh Elhambakhsh, Daniele Grandi, Hyunwoong Ko
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2505.01627
Pdf URL: https://arxiv.org/pdf/2505.01627
Copy Paste: [[2505.01627]] A Domain Adaptation of Large Language Models for Classifying Mechanical Assembly Components(https://arxiv.org/abs/2505.01627)
Keywords: large language model
Abstract: The conceptual design phase represents a critical early stage in the product development process, where designers generate potential solutions that meet predefined design specifications based on functional requirements. Functional modeling, a foundational aspect of this phase, enables designers to reason about product functions before specific structural details are determined. A widely adopted approach to functional modeling is the Function-Behavior-Structure (FBS) framework, which supports the transformation of functional intent into behavioral and structural descriptions. However, the effectiveness of function-based design is often hindered by the lack of well-structured and comprehensive functional data. This scarcity can negatively impact early design decision-making and hinder the development of accurate behavioral models. Recent advances in Large Language Models (LLMs), such as those based on GPT architectures, offer a promising avenue to address this gap. LLMs have demonstrated significant capabilities in language understanding and natural language processing (NLP), making them suitable for automated classification tasks. This study proposes a novel LLM-based domain adaptation (DA) framework using fine-tuning for the automated classification of mechanical assembly parts' functions. By fine-tuning LLMs on domain-specific datasets, the traditionally manual and subjective process of function annotation can be improved in both accuracy and consistency. A case study demonstrates fine-tuning GPT-3.5 Turbo on data from the Oregon State Design Repository (OSDR), and evaluation on the A Big CAD (ABC) dataset shows that the domain-adapted LLM can generate high-quality functional data, enhancing the semantic representation of mechanical parts and supporting more effective design exploration in early-phase engineering.

Title: Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability

Authors: Wenxuan Zhang, Peng Hu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.01650
Pdf URL: https://arxiv.org/pdf/2505.01650
Copy Paste: [[2505.01650]] Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability(https://arxiv.org/abs/2505.01650)
Keywords: transformer
Abstract: The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high precision and minimal delay. This paper investigates the feasibility and effectiveness of employing vision sensors for SOD tasks based on deep learning (DL) models. It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT), and the Generalized Efficient Layer Aggregation Network (GELAN) and evaluates their performance under SOD scenarios. Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280. Compared to the baseline GELAN-t model, the proposed GELAN-ViT-SE model increases the average mAP50 from 0.721 to 0.751, improves the mAP50:95 from 0.266 to 0.274, reduces giga floating point operations (GFLOPs) from 7.3 to 5.6, and lowers peak power consumption from 2080.7 mW to 2028.7 mW by 2.5\%.

Title: Causally Fair Node Classification on Non-IID Graph Data

Authors: Yucong Dai, Lu Zhang, Yaowei Hu, Susan Gauch, Yongkai Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01652
Pdf URL: https://arxiv.org/pdf/2505.01652
Copy Paste: [[2505.01652]] Causally Fair Node Classification on Non-IID Graph Data(https://arxiv.org/abs/2505.01652)
Keywords: fair
Abstract: Fair machine learning seeks to identify and mitigate biases in predictions against unfavorable populations characterized by demographic attributes, such as race and gender. Recently, a few works have extended fairness to graph data, such as social networks, but most of them neglect the causal relationships among data instances. This paper addresses the prevalent challenge in fairness-aware ML algorithms, which typically assume Independent and Identically Distributed (IID) data. We tackle the overlooked domain of non-IID, graph-based settings where data instances are interconnected, influencing the outcomes of fairness interventions. We base our research on the Network Structural Causal Model (NSCM) framework and posit two main assumptions: Decomposability and Graph Independence, which enable the computation of interventional distributions in non-IID settings using the $do$-calculus. Based on that, we develop the Message Passing Variational Autoencoder for Causal Inference (MPVA) to compute interventional distributions and facilitate causally fair node classification through estimated interventional distributions. Empirical evaluations on semi-synthetic and real-world datasets demonstrate that MPVA outperforms conventional methods by effectively approximating interventional distributions and mitigating bias. The implications of our findings underscore the potential of causality-based fairness in complex ML applications, setting the stage for further research into relaxing the initial assumptions to enhance model fairness.

Title: A Novel WaveInst-based Network for Tree Trunk Structure Extraction and Pattern Analysis in Forest Inventory

Authors: Chenyang Fan, Xujie Zhu, Taige Luo, Sheng Xu, Zhulin Chen, Hongxin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01656
Pdf URL: https://arxiv.org/pdf/2505.01656
Copy Paste: [[2505.01656]] A Novel WaveInst-based Network for Tree Trunk Structure Extraction and Pattern Analysis in Forest Inventory(https://arxiv.org/abs/2505.01656)
Keywords: extraction, segmentation
Abstract: The pattern analysis of tree structure holds significant scientific value for genetic breeding and forestry management. The current trunk and branch extraction technologies are mainly LiDAR-based or UAV-based. The former approaches obtain high-precision 3D data, but its equipment cost is high and the three-dimensional (3D) data processing is complex. The latter approaches efficiently capture canopy information, but they miss the 3-D structure of trees. In order to deal with the branch information extraction from the complex background interference and occlusion, this work proposes a novel WaveInst instance segmentation framework, involving a discrete wavelet transform, to enhance multi-scale edge information for accurately improving tree structure extraction. Experimental results of the proposed model show superior performance on SynthTree43k, CaneTree100, Urban Street and our PoplarDataset. Moreover, we present a new Phenotypic dataset PoplarDataset, which is dedicated to extract tree structure and pattern analysis from artificial forest. The proposed method achieves a mean average precision of 49.6 and 24.3 for the structure extraction of mature and juvenile trees, respectively, surpassing the existing state-of-the-art method by 9.9. Furthermore, by in tegrating the segmentation model within the regression model, we accurately achieve significant tree grown parameters, such as the location of trees, the diameter-at-breast-height of individual trees, and the plant height, from 2D images directly. This study provides a scientific and plenty of data for tree structure analysis in related to the phenotype research, offering a platform for the significant applications in precision forestry, ecological monitoring, and intelligent breeding.

Title: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Authors: Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01658
Pdf URL: https://arxiv.org/pdf/2505.01658
Copy Paste: [[2505.01658]] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency(https://arxiv.org/abs/2505.01658)
Keywords: security, large language model
Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: this https URL

Title: Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study

Authors: Tamim Ahmed, Thanassis Rikakis
Subjects: cs.CV, cs.AI, cs.HC, math.PR
Abstract URL: https://arxiv.org/abs/2505.01680
Pdf URL: https://arxiv.org/pdf/2505.01680
Copy Paste: [[2505.01680]] Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study(https://arxiv.org/abs/2505.01680)
Keywords: interpretability, transformer
Abstract: Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.

Title: High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers

Authors: Brian Wong, Kaito Tanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01693
Pdf URL: https://arxiv.org/pdf/2505.01693
Copy Paste: [[2505.01693]] High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers(https://arxiv.org/abs/2505.01693)
Keywords: robust, transformer, large language model
Abstract: Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large language models (LLMs) demonstrate strong text understanding, their direct application for large-scale, efficient labeling is limited by computational cost and speed. This paper introduces DeBERTa-RAD, a novel two-stage framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling. We leverage an advanced LLM to generate high-quality pseudo-labels, including certainty statuses, for a large corpus of reports. Subsequently, a DeBERTa-Base model is trained on this pseudo-labeled data using a tailored knowledge distillation strategy. Evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120, significantly outperforming established rule-based systems, fine-tuned transformer models, and direct LLM inference, while maintaining a practical inference speed suitable for high-throughput applications. Our analysis shows particular strength in handling uncertain findings. This work demonstrates a promising path to overcome data annotation bottlenecks and achieve high-performance medical text processing through the strategic combination of LLM capabilities and efficient student models trained via distillation.

Title: Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning

Authors: Yifan Liu, Ruichen Yao, Yaokun Liu, Ruohan Zong, Zelin Li, Yang Zhang, Dong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01699
Pdf URL: https://arxiv.org/pdf/2505.01699
Copy Paste: [[2505.01699]] Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning(https://arxiv.org/abs/2505.01699)
Keywords: fair
Abstract: The widespread integration of face recognition technologies into various applications (e.g., access control and personalized advertising) necessitates a critical emphasis on fairness. While previous efforts have focused on demographic fairness, the fairness of individual biological face components remains unexplored. In this paper, we focus on face component fairness, a fairness notion defined by biological face features. To our best knowledge, our work is the first work to mitigate bias of face attribute prediction at the biological feature level. In this work, we identify two key challenges in optimizing face component fairness: attribute label scarcity and attribute inter-dependencies, both of which limit the effectiveness of bias mitigation from previous approaches. To address these issues, we propose \textbf{B}ayesian \textbf{N}etwork-informed \textbf{M}eta \textbf{R}eweighting (BNMR), which incorporates a Bayesian Network calibrator to guide an adaptive meta-learning-based sample reweighting process. During the training process of our approach, the Bayesian Network calibrator dynamically tracks model bias and encodes prior probabilities for face component attributes to overcome the above challenges. To demonstrate the efficacy of our approach, we conduct extensive experiments on a large-scale real-world human face dataset. Our results show that BNMR is able to consistently outperform recent face bias mitigation baselines. Moreover, our results suggest a positive impact of face component fairness on the commonly considered demographic fairness (e.g., \textit{gender}). Our findings pave the way for new research avenues on face component fairness, suggesting that face component fairness could serve as a potential surrogate objective for demographic fairness. The code for our work is publicly available~\footnote{this https URL}.

Title: Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

Authors: Alexander Davis, Rafael Souza, Jia-Hao Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01711
Pdf URL: https://arxiv.org/pdf/2505.01711
Copy Paste: [[2505.01711]] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings(https://arxiv.org/abs/2505.01711)
Keywords: large language model
Abstract: Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.

Title: Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

Authors: Congqi Cao, Lanshu Hu, Yating Yu, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01713
Pdf URL: https://arxiv.org/pdf/2505.01713
Copy Paste: [[2505.01713]] Vision and Intention Boost Large Language Model in Long-Term Action Anticipation(https://arxiv.org/abs/2505.01713)
Keywords: large language model
Abstract: Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion strategy, resulting in intention-enhanced visual representations. These enhanced visual representations, along with textual prompts, are fed into LLM for future action anticipation. Furthermore, we propose an effective example selection strategy jointly considers visual and textual similarities, providing more relevant and informative examples for in-context learning. Extensive experiments with state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets fully demonstrate the effectiveness and superiority of the proposed method.

Title: Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes

Authors: Jie Liu, Pan Zhou, Zehao Xiao, Jiayi Shen, Wenzhe Yin, Jan-Jakob Sonke, Efstratios Gavves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01726
Pdf URL: https://arxiv.org/pdf/2505.01726
Copy Paste: [[2505.01726]] Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes(https://arxiv.org/abs/2505.01726)
Keywords: segmentation
Abstract: Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentation, and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose NPISeg3D, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model's ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.

Title: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Authors: Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01729
Pdf URL: https://arxiv.org/pdf/2505.01729
Copy Paste: [[2505.01729]] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth(https://arxiv.org/abs/2505.01729)
Keywords: robust, diffusion, generative
Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Title: Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

Authors: Chuan Sun, Han Yu, Lizhen Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01731
Pdf URL: https://arxiv.org/pdf/2505.01731
Copy Paste: [[2505.01731]] Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models(https://arxiv.org/abs/2505.01731)
Keywords: transformer, large language model
Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underline{S}hapley \underline{V}alue-based \underline{N}on-\underline{U}niform \underline{P}runing (\methodname{}) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname{} achieves a reduction in perplexity (PPL) of 18.01\% and 19.55\% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70\% sparsity.

Title: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01746
Pdf URL: https://arxiv.org/pdf/2505.01746
Copy Paste: [[2505.01746]] Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion(https://arxiv.org/abs/2505.01746)
Keywords: diffusion
Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{this https URL}{\textit{this https URL}}.

Title: Unified Steganography via Implicit Neural Representation

Authors: Qi Song, Ziyuan Luo, Xiufeng Huang, Sheng Li, Renjie Wan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01749
Pdf URL: https://arxiv.org/pdf/2505.01749
Copy Paste: [[2505.01749]] Unified Steganography via Implicit Neural Representation(https://arxiv.org/abs/2505.01749)
Keywords: security, privacy
Abstract: Digital steganography is the practice of concealing for encrypted data transmission. Typically, steganography methods embed secret data into cover data to create stega data that incorporates hidden secret data. However, steganography techniques often require designing specific frameworks for each data type, which restricts their generalizability. In this paper, we present U-INR, a novel method for steganography via Implicit Neural Representation (INR). Rather than using the specific framework for each data format, we directly use the neurons of the INR network to represent the secret data and cover data across different data types. To achieve this idea, a private key is shared between the data sender and receivers. Such a private key can be used to determine the position of secret data in INR networks. To effectively leverage this key, we further introduce a key-based selection strategy that can be used to determine the position within the INRs for data storage. Comprehensive experiments across multiple data types, including images, videos, audio, and SDF and NeRF, demonstrate the generalizability and effectiveness of U-INR, emphasizing its potential for improving data security and privacy in various applications.

Title: Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Authors: Tobias Domhan, Dawei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01761
Pdf URL: https://arxiv.org/pdf/2505.01761
Copy Paste: [[2505.01761]] Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models(https://arxiv.org/abs/2505.01761)
Keywords: large language model
Abstract: Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.

Title: Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement

Authors: Long Bai, Boyi Ma, Ruohan Wang, Guankun Wang, Beilei Cui, Zhongliang Jiang, Mobarakol Islam, Zhe Min, Jiewen Lai, Nassir Navab, Hongliang Ren
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.01766
Pdf URL: https://arxiv.org/pdf/2505.01766
Copy Paste: [[2505.01766]] Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement(https://arxiv.org/abs/2505.01766)
Keywords: robust
Abstract: Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.

Title: Energy-Efficient NTT Sampler for Kyber Benchmarked on FPGA

Authors: Paresh Baidya, Rourab Paul, Vikas Srivastava, Sumit Kumar Debnath
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01782
Pdf URL: https://arxiv.org/pdf/2505.01782
Copy Paste: [[2505.01782]] Energy-Efficient NTT Sampler for Kyber Benchmarked on FPGA(https://arxiv.org/abs/2505.01782)
Keywords: security
Abstract: Kyber is a lattice-based key encapsulation mechanism selected for standardization by the NIST Post-Quantum Cryptography (PQC) project. A critical component of Kyber's key generation process is the sampling of matrix elements from a uniform distribution over the ring Rq . This step is one of the most computationally intensive tasks in the scheme, significantly impacting performance in low-power embedded systems such as Internet of Things (IoT), wearable devices, wireless sensor networks (WSNs), smart cards, TPMs (Trusted Platform Modules), etc. Existing approaches to this sampling, notably conventional SampleNTT and Parse-SPDM3, rely on rejection sampling. Both algorithms require a large number of random bytes, which needs at least three SHAKE-128 squeezing steps per polynomial. As a result, it causes significant amount of latency and energy. In this work, we propose a novel and efficient sampling algorithm, namely Modified SampleNTT, which substantially educes the average number of bits required from SHAKE-128 to generate elements in Rq - achieving approximately a 33% reduction compared to conventional SampleNTT. Modified SampleNTT achieves 99.16% success in generating a complete polynomial using only two SHAKE-128 squeezes, outperforming both state-of-the-art methods, which never succeed in two squeezes of SHAKE-128. Furthermore, our algorithm maintains the same average rejection rate as existing techniques and passes all standard statistical tests for randomness quality. FPGA implementation on Artix-7 demonstrates a 33.14% reduction in energy, 33.32% lower latency, and 0.28% fewer slices compared to SampleNTT. Our results confirm that Modified SampleNTT is an efficient and practical alternative for uniform polynomial sampling in PQC schemes such as Kyber, especially for low-power security processors.

Title: Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition

Authors: Amirmohammad Farzaneh, Osvaldo Simeone
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2505.01783
Pdf URL: https://arxiv.org/pdf/2505.01783
Copy Paste: [[2505.01783]] Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition(https://arxiv.org/abs/2505.01783)
Keywords: security
Abstract: Online anomaly detection is essential in fields such as cybersecurity, healthcare, and industrial monitoring, where promptly identifying deviations from expected behavior can avert critical failures or security breaches. While numerous anomaly scoring methods based on supervised or unsupervised learning have been proposed, current approaches typically rely on a continuous stream of real-world calibration data to provide assumption-free guarantees on the false discovery rate (FDR). To address the inherent challenges posed by limited real calibration data, we introduce context-aware prediction-powered conformal online anomaly detection (C-PP-COAD). Our framework strategically leverages synthetic calibration data to mitigate data scarcity, while adaptively integrating real data based on contextual cues. C-PP-COAD utilizes conformal p-values, active p-value statistics, and online FDR control mechanisms to maintain rigorous and reliable anomaly detection performance over time. Experiments conducted on both synthetic and real-world datasets demonstrate that C-PP-COAD significantly reduces dependency on real calibration data without compromising guaranteed FDR control.

Title: Privacy Preserving Machine Learning Model Personalization through Federated Personalized Learning

Authors: Md. Tanzib Hosain, Asif Zaman, Md. Shahriar Sajid, Shadman Sakeeb Khan, Shanjida Akter
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2505.01788
Pdf URL: https://arxiv.org/pdf/2505.01788
Copy Paste: [[2505.01788]] Privacy Preserving Machine Learning Model Personalization through Federated Personalized Learning(https://arxiv.org/abs/2505.01788)
Keywords: privacy, federate
Abstract: The widespread adoption of Artificial Intelligence (AI) has been driven by significant advances in intelligent system research. However, this progress has raised concerns about data privacy, leading to a growing awareness of the need for privacy-preserving AI. In response, there has been a seismic shift in interest towards the leading paradigm for training Machine Learning (ML) models on decentralized data silos while maintaining data privacy, Federated Learning (FL). This research paper presents a comprehensive performance analysis of a cutting-edge approach to personalize ML model while preserving privacy achieved through Privacy Preserving Machine Learning with the innovative framework of Federated Personalized Learning (PPMLFPL). Regarding the increasing concerns about data privacy, this study evaluates the effectiveness of PPMLFPL addressing the critical balance between personalized model refinement and maintaining the confidentiality of individual user data. According to our analysis, Adaptive Personalized Cross-Silo Federated Learning with Differential Privacy (APPLE+DP) offering efficient execution whereas overall, the use of the Adaptive Personalized Cross-Silo Federated Learning with Homomorphic Encryption (APPLE+HE) algorithm for privacy-preserving machine learning tasks in federated personalized learning settings is strongly suggested. The results offer valuable insights creating it a promising scope for future advancements in the field of privacy-conscious data-driven technologies.

Title: A Multimodal Framework for Explainable Evaluation of Soft Skills in Educational Environments

Authors: Jared D.T. Guerrero-Sosa, Francisco P. Romero, Víctor Hugo Menéndez-Domínguez, Jesus Serrano-Guerrero, Andres Montoro-Montarroso, Jose A. Olivas
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.01794
Pdf URL: https://arxiv.org/pdf/2505.01794
Copy Paste: [[2505.01794]] A Multimodal Framework for Explainable Evaluation of Soft Skills in Educational Environments(https://arxiv.org/abs/2505.01794)
Keywords: interpretability
Abstract: In the rapidly evolving educational landscape, the unbiased assessment of soft skills is a significant challenge, particularly in higher education. This paper presents a fuzzy logic approach that employs a Granular Linguistic Model of Phenomena integrated with multimodal analysis to evaluate soft skills in undergraduate students. By leveraging computational perceptions, this approach enables a structured breakdown of complex soft skill expressions, capturing nuanced behaviours with high granularity and addressing their inherent uncertainties, thereby enhancing interpretability and reliability. Experiments were conducted with undergraduate students using a developed tool that assesses soft skills such as decision-making, communication, and creativity. This tool identifies and quantifies subtle aspects of human interaction, such as facial expressions and gesture recognition. The findings reveal that the framework effectively consolidates multiple data inputs to produce meaningful and consistent assessments of soft skills, showing that integrating multiple modalities into the evaluation process significantly improves the quality of soft skills scores, making the assessment work transparent and understandable to educational stakeholders.

Title: Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis

Authors: Chidimma Opara
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01800
Pdf URL: https://arxiv.org/pdf/2505.01800
Copy Paste: [[2505.01800]] Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis(https://arxiv.org/abs/2505.01800)
Keywords: generative
Abstract: The increasing sophistication of AI-generated texts highlights the urgent need for accurate and transparent detection tools, especially in educational settings, where verifying authorship is essential. Existing literature has demonstrated that the application of stylometric features with machine learning classifiers can yield excellent results. Building on this foundation, this study proposes a comprehensive framework that integrates stylometric analysis with psycholinguistic theories, offering a clear and interpretable approach to distinguishing between AI-generated and human-written texts. This research specifically maps 31 distinct stylometric features to cognitive processes such as lexical retrieval, discourse planning, cognitive load management, and metacognitive self-monitoring. In doing so, it highlights the unique psycholinguistic patterns found in human writing. Through the intersection of computational linguistics and cognitive science, this framework contributes to the development of reliable tools aimed at preserving academic integrity in the era of generative AI.

Title: Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing

Authors: Yuchang Jiang, Maxim Neumann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01805
Pdf URL: https://arxiv.org/pdf/2505.01805
Copy Paste: [[2505.01805]] Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing(https://arxiv.org/abs/2505.01805)
Keywords: transformer, segmentation
Abstract: Developing accurate and reliable models for forest types mapping is critical to support efforts for halting deforestation and for biodiversity conservation (such as European Union Deforestation Regulation (EUDR)). This work introduces ForTy, a benchmark for global-scale FORest TYpes mapping using multi-temporal satellite data1. The benchmark comprises 200,000 time series of image patches, each consisting of Sentinel-2, Sentinel-1, climate, and elevation data. Each time series captures variations at monthly or seasonal cadence. Per-pixel annotations, including forest types and other land use classes, support image segmentation tasks. Unlike most existing land use products that often categorize all forest areas into a single class, our benchmark differentiates between three forest types classes: natural forest, planted forest, and tree crops. By leveraging multiple public data sources, we achieve global coverage with this benchmark. We evaluate the forest types dataset using several baseline models, including convolution neural networks and transformer-based models. Additionally, we propose a novel transformer-based model specifically designed to handle multi-modal, multi-temporal satellite data for forest types mapping. Our experimental results demonstrate that the proposed model surpasses the baseline models in performance.

Title: Conformal Prediction for Indoor Positioning with Correctness Coverage Guarantees

Authors: Zhiyi Zhou, Hexin Peng, Hongyu Long
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01810
Pdf URL: https://arxiv.org/pdf/2505.01810
Copy Paste: [[2505.01810]] Conformal Prediction for Indoor Positioning with Correctness Coverage Guarantees(https://arxiv.org/abs/2505.01810)
Keywords: interpretability
Abstract: With the advancement of Internet of Things (IoT) technologies, high-precision indoor positioning has become essential for Location-Based Services (LBS) in complex indoor environments. Fingerprint-based localization is popular, but traditional algorithms and deep learning-based methods face challenges such as poor generalization, overfitting, and lack of interpretability. This paper applies conformal prediction (CP) to deep learning-based indoor positioning. CP transforms the uncertainty of the model into a non-conformity score, constructs prediction sets to ensure correctness coverage, and provides statistical guarantees. We also introduce conformal risk control for path navigation tasks to manage the false discovery rate (FDR) and the false negative rate (FNR).The model achieved an accuracy of approximately 100% on the training dataset and 85% on the testing dataset, effectively demonstrating its performance and generalization capability. Furthermore, we also develop a conformal p-value framework to control the proportion of position-error points. Experiments on the UJIIndoLoc dataset using lightweight models such as MobileNetV1, VGG19, MobileNetV2, ResNet50, and EfficientNet show that the conformal prediction technique can effectively approximate the target coverage, and different models have different performance in terms of prediction set size and uncertainty quantification.

Title: Backdoor Attacks Against Patch-based Mixture of Experts

Authors: Cedric Chan, Jona te Lintelo, Stjepan Picek
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01811
Pdf URL: https://arxiv.org/pdf/2505.01811
Copy Paste: [[2505.01811]] Backdoor Attacks Against Patch-based Mixture of Experts(https://arxiv.org/abs/2505.01811)
Keywords: security, defense, attack
Abstract: As Deep Neural Networks (DNNs) continue to require larger amounts of data and computational power, Mixture of Experts (MoE) models have become a popular choice to reduce computational complexity. This popularity increases the importance of considering the security of MoE architectures. Unfortunately, the security of models using a MoE architecture has not yet gained much attention compared to other DNN models. In this work, we investigate the vulnerability of patch-based MoE (pMoE) models for image classification against backdoor attacks. We examine multiple trigger generation methods and Fine-Pruning as a defense. To better understand a pMoE model's vulnerability to backdoor attacks, we investigate which factors affect the model's patch selection. Our work shows that pMoE models are highly susceptible to backdoor attacks. More precisely, we achieve high attack success rates of up to 100% with visible triggers and a 2% poisoning rate, whilst only having a clean accuracy drop of 1.0%. Additionally, we show that pruning itself is ineffective as a defense but that fine-tuning can remove the backdoor almost completely. Our results show that fine-tuning the model for five epochs reduces the attack success rate to 2.1% whilst sacrificing 1.4% accuracy.

Title: $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Authors: Core Francisco Park, Zechen Zhang, Hidenori Tanaka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01812
Pdf URL: https://arxiv.org/pdf/2505.01812
Copy Paste: [[2505.01812]] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge(https://arxiv.org/abs/2505.01812)
Keywords: robust, large language model
Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce $\textit{New News}$, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term $\textit{System-2 Fine-tuning}$ (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the $\textit{contexual shadowing effect}$, where training with the news $\textit{in context}$ followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

Title: Rogue Cell: Adversarial Attack and Defense in Untrusted O-RAN Setup Exploiting the Traffic Steering xApp

Authors: Eran Aizikovich, Dudu Mimran, Edita Grolman, Yuval Elovici, Asaf Shabtai
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01816
Pdf URL: https://arxiv.org/pdf/2505.01816
Copy Paste: [[2505.01816]] Rogue Cell: Adversarial Attack and Defense in Untrusted O-RAN Setup Exploiting the Traffic Steering xApp(https://arxiv.org/abs/2505.01816)
Keywords: security, defense, attack, fair
Abstract: The Open Radio Access Network (O-RAN) architecture is revolutionizing cellular networks with its open, multi-vendor design and AI-driven management, aiming to enhance flexibility and reduce costs. Although it has many advantages, O-RAN is not threat-free. While previous studies have mainly examined vulnerabilities arising from O-RAN's intelligent components, this paper is the first to focus on the security challenges and vulnerabilities introduced by transitioning from single-operator to multi-operator RAN architectures. This shift increases the risk of untrusted third-party operators managing different parts of the network. To explore these vulnerabilities and their potential mitigation, we developed an open-access testbed environment that integrates a wireless network simulator with the official O-RAN Software Community (OSC) RAN intelligent component (RIC) cluster. This environment enables realistic, live data collection and serves as a platform for demonstrating APATE (adversarial perturbation against traffic efficiency), an evasion attack in which a malicious cell manipulates its reported key performance indicators (KPIs) and deceives the O-RAN traffic steering to gain unfair allocations of user equipment (UE). To ensure that O-RAN's legitimate activity continues, we introduce MARRS (monitoring adversarial RAN reports), a detection framework based on a long-short term memory (LSTM) autoencoder (AE) that learns contextual features across the network to monitor malicious telemetry (also demonstrated in our testbed). Our evaluation showed that by executing APATE, an attacker can obtain a 248.5% greater UE allocation than it was supposed to in a benign scenario. In addition, the MARRS detection method was also shown to successfully classify malicious cell activity, achieving accuracy of 99.2% and an F1 score of 0.978.

Title: An LSTM-PINN Hybrid Method to the specific problem of population forecasting

Authors: Ze Tao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01819
Pdf URL: https://arxiv.org/pdf/2505.01819
Copy Paste: [[2505.01819]] An LSTM-PINN Hybrid Method to the specific problem of population forecasting(https://arxiv.org/abs/2505.01819)
Keywords: robust
Abstract: Deep learning has emerged as a powerful tool in scientific modeling, particularly for complex dynamical systems; however, accurately capturing age-structured population dynamics under policy-driven fertility changes remains a significant challenge due to the lack of effective integration between domain knowledge and long-term temporal dependencies. To address this issue, we propose two physics-informed deep learning frameworks--PINN and LSTM-PINN--that incorporate policy-aware fertility functions into a transport-reaction partial differential equation to simulate population evolution from 2024 to 2054. The standard PINN model enforces the governing equation and boundary conditions via collocation-based training, enabling accurate learning of underlying population dynamics and ensuring stable convergence. Building on this, the LSTM-PINN framework integrates sequential memory mechanisms to effectively capture long-range dependencies in the age-time domain, achieving robust training performance across multiple loss components. Simulation results under three distinct fertility policy scenarios-the Three-child policy, the Universal two-child policy, and the Separate two-child policy--demonstrate the models' ability to reflect policy-sensitive demographic shifts and highlight the effectiveness of integrating domain knowledge into data-driven forecasting. This study provides a novel and extensible framework for modeling age-structured population dynamics under policy interventions, offering valuable insights for data-informed demographic forecasting and long-term policy planning in the face of emerging population challenges.

Title: Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

Authors: Jifeng Hu, Sili Huang, Zhejian Yang, Shengchao Hu, Li Shen, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01822
Pdf URL: https://arxiv.org/pdf/2505.01822
Copy Paste: [[2505.01822]] Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning(https://arxiv.org/abs/2505.01822)
Keywords: diffusion
Abstract: Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.

Title: PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

Authors: Nitin Rai, Arnold W. Schumann, Nathan Boyd
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2505.01823
Pdf URL: https://arxiv.org/pdf/2505.01823
Copy Paste: [[2505.01823]] PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach(https://arxiv.org/abs/2505.01823)
Keywords: diffusion, generative
Abstract: Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M's ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.

Title: CVVNet: A Cross-Vertical-View Network for Gait Recognition

Authors: Xiangru Li, Wei Song, Yingda Huang, Wei Meng, Le Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01837
Pdf URL: https://arxiv.org/pdf/2505.01837
Copy Paste: [[2505.01837]] CVVNet: A Cross-Vertical-View Network for Gait Recognition(https://arxiv.org/abs/2505.01837)
Keywords: robust, extraction
Abstract: Gait recognition enables contact-free, long-range person identification that is robust to clothing variations and non-cooperative scenarios. While existing methods perform well in controlled indoor environments, they struggle with cross-vertical view scenarios, where surveillance angles vary significantly in elevation. Our experiments show up to 60\% accuracy degradation in low-to-high vertical view settings due to severe deformations and self-occlusions of key anatomical features. Current CNN and self-attention-based methods fail to effectively handle these challenges, due to their reliance on single-scale convolutions or simplistic attention mechanisms that lack effective multi-frequency feature integration. To tackle this challenge, we propose CVVNet (Cross-Vertical-View Network), a frequency aggregation architecture specifically designed for robust cross-vertical-view gait recognition. CVVNet employs a High-Low Frequency Extraction module (HLFE) that adopts parallel multi-scale convolution/max-pooling path and self-attention path as high- and low-frequency mixers for effective multi-frequency feature extraction from input silhouettes. We also introduce the Dynamic Gated Aggregation (DGA) mechanism to adaptively adjust the fusion ratio of high- and low-frequency features. The integration of our core Multi-Scale Attention Gated Aggregation (MSAGA) module, HLFE and DGA enables CVVNet to effectively handle distortions from view changes, significantly improving the recognition robustness across different vertical views. Experimental results show that our CVVNet achieves state-of-the-art performance, with $8.6\%$ improvement on DroneGait and $2\%$ on Gait3D compared with the best existing methods.

Title: MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization

Authors: Chenghong Li, Hongjie Liao, Yihao Zhi, Xihe Yang, Zhengwentai Sun, Jiahao Chang, Shuguang Cui, Xiaoguang Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01838
Pdf URL: https://arxiv.org/pdf/2505.01838
Copy Paste: [[2505.01838]] MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization(https://arxiv.org/abs/2505.01838)
Keywords: large language model
Abstract: In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable large-scale human dataset. To bridge this gap, we present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using multi-view human capture systems, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. Additionally, the proposed MVHumanNet++ dataset is enhanced with newly processed normal maps and depth maps, significantly expanding its applicability and utility for advanced human-centric research. To explore the potential of our proposed MVHumanNet++ dataset in various 2D and 3D visual tasks, we conducted several pilot studies to demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet++. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet++ dataset with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. MVHumanNet++ is publicly available at this https URL.

Title: Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

Authors: Chaomeng Chen, Zitong Yu, Junhao Dong, Sen Su, Linlin Shen, Shutao Xia, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01851
Pdf URL: https://arxiv.org/pdf/2505.01851
Copy Paste: [[2505.01851]] Mitigating Group-Level Fairness Disparities in Federated Visual Language Models(https://arxiv.org/abs/2505.01851)
Keywords: privacy, federate, fair
Abstract: Visual language models (VLMs) have shown remarkable capabilities in multimodal tasks but face challenges in maintaining fairness across demographic groups, particularly when deployed in federated learning (FL) environments. This paper addresses the critical issue of group fairness in federated VLMs by introducing FVL-FP, a novel framework that combines FL with fair prompt tuning techniques. We focus on mitigating demographic biases while preserving model performance through three innovative components: (1) Cross-Layer Demographic Fair Prompting (CDFP), which adjusts potentially biased embeddings through counterfactual regularization; (2) Demographic Subspace Orthogonal Projection (DSOP), which removes demographic bias in image representations by mapping fair prompt text to group subspaces; and (3) Fair-aware Prompt Fusion (FPF), which dynamically balances client contributions based on both performance and fairness metrics. Extensive evaluations across four benchmark datasets demonstrate that our approach reduces demographic disparity by an average of 45\% compared to standard FL approaches, while maintaining task performance within 6\% of state-of-the-art results. FVL-FP effectively addresses the challenges of non-IID data distributions in federated settings and introduces minimal computational overhead while providing significant fairness benefits. Our work presents a parameter-efficient solution to the critical challenge of ensuring equitable performance across demographic groups in privacy-preserving multimodal systems.

Title: Intra-Layer Recurrence in Transformers for Language Modeling

Authors: Anthony Nguyen, Wenjun Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01855
Pdf URL: https://arxiv.org/pdf/2505.01855
Copy Paste: [[2505.01855]] Intra-Layer Recurrence in Transformers for Language Modeling(https://arxiv.org/abs/2505.01855)
Keywords: transformer
Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.

Title: DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

Authors: Haoteng Li, Zhao Yang, Zezhong Qian, Gongpeng Zhao, Yuqi Huang, Jun Yu, Huazheng Zhou, Longjun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01857
Pdf URL: https://arxiv.org/pdf/2505.01857
Copy Paste: [[2505.01857]] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion(https://arxiv.org/abs/2505.01857)
Keywords: diffusion, segmentation
Abstract: Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

Title: PQS-BFL: A Post-Quantum Secure Blockchain-based Federated Learning Framework

Authors: Daniel Commey, Garth V. Crosby
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01866
Pdf URL: https://arxiv.org/pdf/2505.01866
Copy Paste: [[2505.01866]] PQS-BFL: A Post-Quantum Secure Blockchain-based Federated Learning Framework(https://arxiv.org/abs/2505.01866)
Keywords: secure, security, privacy, attack, federate
Abstract: Federated Learning (FL) enables collaborative model training while preserving data privacy, but its classical cryptographic underpinnings are vulnerable to quantum attacks. This vulnerability is particularly critical in sensitive domains like healthcare. This paper introduces PQS-BFL (Post-Quantum Secure Blockchain-based Federated Learning), a framework integrating post-quantum cryptography (PQC) with blockchain verification to secure FL against quantum adversaries. We employ ML-DSA-65 (a FIPS 204 standard candidate, formerly Dilithium) signatures to authenticate model updates and leverage optimized smart contracts for decentralized validation. Extensive evaluations on diverse datasets (MNIST, SVHN, HAR) demonstrate that PQS-BFL achieves efficient cryptographic operations (average PQC sign time: 0.65 ms, verify time: 0.53 ms) with a fixed signature size of 3309 Bytes. Blockchain integration incurs a manageable overhead, with average transaction times around 4.8 s and gas usage per update averaging 1.72 x 10^6 units for PQC configurations. Crucially, the cryptographic overhead relative to transaction time remains minimal (around 0.01-0.02% for PQC with blockchain), confirming that PQC performance is not the bottleneck in blockchain-based FL. The system maintains competitive model accuracy (e.g., over 98.8% for MNIST with PQC) and scales effectively, with round times showing sublinear growth with increasing client numbers. Our open-source implementation and reproducible benchmarks validate the feasibility of deploying long-term, quantum-resistant security in practical FL systems.

Title: Positional Attention for Efficient BERT-Based Named Entity Recognition

Authors: Mo Sun, Siheng Xiong, Yuankai Cai, Bowen Zuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01868
Pdf URL: https://arxiv.org/pdf/2505.01868
Copy Paste: [[2505.01868]] Positional Attention for Efficient BERT-Based Named Entity Recognition(https://arxiv.org/abs/2505.01868)
Keywords: transformer
Abstract: This paper presents a framework for Named Entity Recognition (NER) leveraging the Bidirectional Encoder Representations from Transformers (BERT) model in natural language processing (NLP). NER is a fundamental task in NLP with broad applicability across downstream applications. While BERT has established itself as a state-of-the-art model for entity recognition, fine-tuning it from scratch for each new application is computationally expensive and time-consuming. To address this, we propose a cost-efficient approach that integrates positional attention mechanisms into the entity recognition process and enables effective customization using pre-trained parameters. The framework is evaluated on a Kaggle dataset derived from the Groningen Meaning Bank corpus and achieves strong performance with fewer training epochs. This work contributes to the field by offering a practical solution for reducing the training cost of BERT-based NER systems while maintaining high accuracy.

Title: An Approach for Handling Missing Attribute Values in Attribute-Based Access Control Policy Mining

Authors: Thang Bui, Elliot Shabram, Anthony Matricia
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01873
Pdf URL: https://arxiv.org/pdf/2505.01873
Copy Paste: [[2505.01873]] An Approach for Handling Missing Attribute Values in Attribute-Based Access Control Policy Mining(https://arxiv.org/abs/2505.01873)
Keywords: security
Abstract: Attribute-Based Access Control (ABAC) enables highly expressive and flexible access decisions by considering a wide range of contextual attributes. ABAC policies use logical expressions that combine these attributes, allowing for precise and context-aware control. Algorithms that mine ABAC policies from legacy access control systems can significantly reduce the costs associated with migrating to ABAC. However, a major challenge in this process is handling incomplete entity information, where some attribute values are missing. This paper introduces an approach that enhances the policy mining process by predicting or inferring missing attribute values. This is accomplished by employing a contextual clustering technique that groups entities according to their known attributes, which are then used to analyze and refine authorization decisions. By effectively managing incomplete data, our approach provides security administrators with a valuable tool to improve their attribute data and ensure a smoother, more efficient transition to ABAC.

Title: Towards Trustworthy Federated Learning with Untrusted Participants

Authors: Youssef Allouah, Rachid Guerraoui, John Stephan
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2505.01874
Pdf URL: https://arxiv.org/pdf/2505.01874
Copy Paste: [[2505.01874]] Towards Trustworthy Federated Learning with Untrusted Participants(https://arxiv.org/abs/2505.01874)
Keywords: privacy, robust, federate
Abstract: Resilience against malicious parties and data privacy are essential for trustworthy distributed learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of workers shares a randomness seed unknown to others. In a setting where malicious workers may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, leveraging shared randomness between workers. We prove that CafCor achieves strong privacy-utility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted. Empirical results on standard benchmarks validate CafCor's practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.

Title: PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

Authors: Trisanth Srinivasan, Santosh Patapati
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2505.01881
Pdf URL: https://arxiv.org/pdf/2505.01881
Copy Paste: [[2505.01881]] PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications(https://arxiv.org/abs/2505.01881)
Keywords: robust
Abstract: Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.

Title: CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture

Authors: Vladimir Frants, Sos Agaian, Karen Panetta, Peter Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01882
Pdf URL: https://arxiv.org/pdf/2505.01882
Copy Paste: [[2505.01882]] CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture(https://arxiv.org/abs/2505.01882)
Keywords: robust, transformer
Abstract: Images used in real-world applications such as image or video retrieval, outdoor surveillance, and autonomous driving suffer from poor weather conditions. When designing robust computer vision systems, removing adverse weather such as haze, rain, and snow is a significant problem. Recently, deep-learning methods offered a solution for a single type of degradation. Current state-of-the-art universal methods struggle with combinations of degradations, such as haze and rain-streak. Few algorithms have been developed that perform well when presented with images containing multiple adverse weather conditions. This work focuses on developing an efficient solution for multiple adverse weather removal using a unified quaternion neural architecture called CMAWRNet. It is based on a novel texture-structure decomposition block, a novel lightweight encoder-decoder quaternion transformer architecture, and an attentive fusion block with low-light correction. We also introduce a quaternion similarity loss function to preserve color information better. The quantitative and qualitative evaluation of the current state-of-the-art benchmarking datasets and real-world images shows the performance advantages of the proposed CMAWRNet compared to other state-of-the-art weather removal approaches dealing with multiple weather artifacts. Extensive computer simulations validate that CMAWRNet improves the performance of downstream applications such as object detection. This is the first time the decomposition approach has been applied to the universal weather removal task.

Title: Automated Sentiment Classification and Topic Discovery in Large-Scale Social Media Streams

Authors: Yiwen Lu, Siheng Xiong, Zhaowei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01883
Pdf URL: https://arxiv.org/pdf/2505.01883
Copy Paste: [[2505.01883]] Automated Sentiment Classification and Topic Discovery in Large-Scale Social Media Streams(https://arxiv.org/abs/2505.01883)
Keywords: robust
Abstract: We present a framework for large-scale sentiment and topic analysis of Twitter discourse. Our pipeline begins with targeted data collection using conflict-specific keywords, followed by automated sentiment labeling via multiple pre-trained models to improve annotation robustness. We examine the relationship between sentiment and contextual features such as timestamp, geolocation, and lexical content. To identify latent themes, we apply Latent Dirichlet Allocation (LDA) on partitioned subsets grouped by sentiment and metadata attributes. Finally, we develop an interactive visualization interface to support exploration of sentiment trends and topic distributions across time and regions. This work contributes a scalable methodology for social media analysis in dynamic geopolitical contexts.

Title: Rethinking Score Distilling Sampling for 3D Editing and Generation

Authors: Xingyu Miao, Haoran Duan, Yang Long, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01888
Pdf URL: https://arxiv.org/pdf/2505.01888
Copy Paste: [[2505.01888]] Rethinking Score Distilling Sampling for 3D Editing and Generation(https://arxiv.org/abs/2505.01888)
Keywords: diffusion
Abstract: Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: this https URL.

Title: OODTE: A Differential Testing Engine for the ONNX Optimizer

Authors: Nikolaos Louloudakis, Ajitha Rajan
Subjects: cs.LG, cs.AI, cs.SE, eess.SY
Abstract URL: https://arxiv.org/abs/2505.01892
Pdf URL: https://arxiv.org/pdf/2505.01892
Copy Paste: [[2505.01892]] OODTE: A Differential Testing Engine for the ONNX Optimizer(https://arxiv.org/abs/2505.01892)
Keywords: segmentation
Abstract: With $700$ stars on GitHub and part of the official ONNX repository, the ONNX Optimizer consists of the standard method to apply graph-based optimizations on ONNX models. However, its ability to preserve model accuracy across optimizations, has not been rigorously explored. We propose OODTE, a utility to automatically and thoroughly assess the correctness of the ONNX Optimizer. OODTE follows a simple, yet effective differential testing and evaluation approach that can be easily adopted to other compiler optimizers. In particular, OODTE utilizes a number of ONNX models, then optimizes them and executes both the original and the optimized variants across a user-defined set of inputs, while automatically logging any issues with the optimization process. Finally, for successfully optimized models, OODTE compares the results, and, if any accuracy deviations are observed, it iteratively repeats the process for each pass of the ONNX Optimizer, to localize the root cause of the differences observed. Using OODTE, we sourced well-known $130$ models from the official ONNX Model Hub, used for a wide variety of tasks (classification, object detection, semantic segmentation, text summarization, question and answering, sentiment analysis) from the official ONNX model hub. We detected 15 issues, 14 of which were previously unknown, associated with optimizer crashes and accuracy deviations. We also observed $9.2$% of all model instances presenting issues leading into the crash of the optimizer, or the generation of an invalid model while using the primary optimizer strategies. In addition, $30$% of the classification models presented accuracy differences across the original and the optimized model variants, while $16.6$% of semantic segmentation and object detection models are also affected, at least to a limited extent.

Title: CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

Authors: Mazal Bethany, Nishant Vishwamitra, Cho-Yu Jason Chiang, Peyman Najafirad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01900
Pdf URL: https://arxiv.org/pdf/2505.01900
Copy Paste: [[2505.01900]] CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation(https://arxiv.org/abs/2505.01900)
Keywords: attack
Abstract: Automated evidence-based misinformation detection systems, which evaluate the veracity of short claims against evidence, lack comprehensive analysis of their adversarial vulnerabilities. Existing black-box text-based adversarial attacks are ill-suited for evidence-based misinformation detection systems, as these attacks primarily focus on token-level substitutions involving gradient or logit-based optimization strategies, which are incapable of fooling the multi-component nature of these detection systems. These systems incorporate both retrieval and claim-evidence comparison modules, which requires attacks to break the retrieval of evidence and/or the comparison module so that it draws incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach that employs a two-agent system, a Prompt Optimization Agent and an Attacker Agent, to create adversarial claim rewritings that manipulate evidence retrieval and mislead claim-evidence comparison, effectively bypassing the system without altering the meaning of the claim. The Attacker Agent produces semantically equivalent rewrites that attempt to mislead detectors, while the Prompt Optimization Agent analyzes failed attack attempts and refines the prompt of the Attacker to guide subsequent rewrites. This enables larger structural and stylistic transformations of the text rather than token-level substitutions, adapting the magnitude of changes based on previous outcomes. Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on binary model decisions to guide its rewriting process, eliminating the need for classifier logits or extensive querying. We evaluate CAMOUFLAGE on four systems, including two recent academic systems and two real-world APIs, with an average attack success rate of 46.92\% while preserving textual coherence and semantic equivalence to the original claims.

Title: From Players to Champions: A Generalizable Machine Learning Approach for Match Outcome Prediction with Insights from the FIFA World Cup

Authors: Ali Al-Bustami, Zaid Ghazal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01902
Pdf URL: https://arxiv.org/pdf/2505.01902
Copy Paste: [[2505.01902]] From Players to Champions: A Generalizable Machine Learning Approach for Match Outcome Prediction with Insights from the FIFA World Cup(https://arxiv.org/abs/2505.01902)
Keywords: robust
Abstract: Accurate prediction of FIFA World Cup match outcomes holds significant value for analysts, coaches, bettors, and fans. This paper presents a machine learning framework specifically designed to forecast match winners in FIFA World Cup. By integrating both team-level historical data and player-specific performance metrics such as goals, assists, passing accuracy, and tackles, we capture nuanced interactions often overlooked by traditional aggregate models. Our methodology processes multi-year data to create year-specific team profiles that account for evolving rosters and player development. We employ classification techniques complemented by dimensionality reduction and hyperparameter optimization, to yield robust predictive models. Experimental results on data from the FIFA 2022 World Cup demonstrate our approach's superior accuracy compared to baseline method. Our findings highlight the importance of incorporating individual player attributes and team-level composition to enhance predictive performance, offering new insights into player synergy, strategic match-ups, and tournament progression scenarios. This work underscores the transformative potential of rich, player-centric data in sports analytics, setting a foundation for future exploration of advanced learning architectures such as graph neural networks to model complex team interactions.

Title: LookAlike: Consistent Distractor Generation in Math MCQs

Authors: Nisarg Parikh, Nigel Fernandez, Alexander Scarlatos, Simon Woodhead, Andrew Lan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01903
Pdf URL: https://arxiv.org/pdf/2505.01903
Copy Paste: [[2505.01903]] LookAlike: Consistent Distractor Generation in Math MCQs(https://arxiv.org/abs/2505.01903)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.

Title: BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Authors: Evan R. Antoniuk, Shehtab Zaman, Tal Ben-Nun, Peggy Li, James Diffenderfer, Busra Demirci, Obadiah Smolenski, Tim Hsu, Anna M. Hiszpanski, Kenneth Chiu, Bhavya Kailkhura, Brian Van Essen
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01912
Pdf URL: https://arxiv.org/pdf/2505.01912
Copy Paste: [[2505.01912]] BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models(https://arxiv.org/abs/2505.01912)
Keywords: generative
Abstract: Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $\boldsymbol{b}$enchmarks for $\boldsymbol{o}$ut-$\boldsymbol{o}$f-distribution $\boldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.

Title: Unemployment Dynamics Forecasting with Machine Learning Regression Models

Authors: Kyungsu Kim
Subjects: cs.LG, econ.EM
Abstract URL: https://arxiv.org/abs/2505.01933
Pdf URL: https://arxiv.org/pdf/2505.01933
Copy Paste: [[2505.01933]] Unemployment Dynamics Forecasting with Machine Learning Regression Models(https://arxiv.org/abs/2505.01933)
Keywords: interpretability
Abstract: In this paper, I explored how a range of regression and machine learning techniques can be applied to monthly U.S. unemployment data to produce timely forecasts. I compared seven models: Linear Regression, SGDRegressor, Random Forest, XGBoost, CatBoost, Support Vector Regression, and an LSTM network, training each on a historical span of data and then evaluating on a later hold-out period. Input features include macro indicators (GDP growth, CPI), labor market measures (job openings, initial claims), financial variables (interest rates, equity indices), and consumer sentiment. I tuned model hyperparameters via cross-validation and assessed performance with standard error metrics and the ability to predict the correct unemployment direction. Across the board, tree-based ensembles (and CatBoost in particular) deliver noticeably better forecasts than simple linear approaches, while the LSTM captures underlying temporal patterns more effectively than other nonlinear methods. SVR and SGDRegressor yield modest gains over standard regression but don't match the consistency of the ensemble and deep-learning models. Interpretability tools ,feature importance rankings and SHAP values, point to job openings and consumer sentiment as the most influential predictors across all methods. By directly comparing linear, ensemble, and deep-learning approaches on the same dataset, our study shows how modern machine-learning techniques can enhance real-time unemployment forecasting, offering economists and policymakers richer insights into labor market trends. In the comparative evaluation of the models, I employed a dataset comprising thirty distinct features over the period from January 2020 through December 2024.

Title: GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

Authors: Yongxin Su, Lin Chen, Kaiting Zhang, Zhongliang Zhao, Chenfeng Hou, Ziping Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01934
Pdf URL: https://arxiv.org/pdf/2505.01934
Copy Paste: [[2505.01934]] GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels(https://arxiv.org/abs/2505.01934)
Keywords: robust
Abstract: We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth modeling of Gaussian primitives and the mutual interference between surfaces during the depth blending. To address these, we propose a 2D Gaussian-based incremental reconstruction strategy coupled with a Surface-aware Depth Rendering mechanism, which significantly enhances geometry accuracy and multi-view consistency. Additionally, the proposed local map design dynamically isolates visible surfaces during tracking, mitigating misalignment caused by occluded regions in global maps while maintaining computational efficiency with increasing Gaussian density. Extensive experiments across multiple datasets demonstrate that GauS-SLAM outperforms comparable methods, delivering superior tracking precision and rendering fidelity. The project page will be made available at this https URL.

Title: UK Finfluencers: Exploring Content, Reach, and Responsibility

Authors: Essam Ghadafi, Panagiotis Andriotis
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01941
Pdf URL: https://arxiv.org/pdf/2505.01941
Copy Paste: [[2505.01941]] UK Finfluencers: Exploring Content, Reach, and Responsibility(https://arxiv.org/abs/2505.01941)
Keywords: protect
Abstract: The rise of social media financial influencers (finfluencers) has significantly transformed the personal finance landscape, making financial advice and insights more accessible to a broader and younger audience. By leveraging digital platforms, these influencers have contributed to the democratization of financial literacy. However, the line between education and promotion is often blurred, as many finfluencers lack formal financial qualifications, raising concerns about the accuracy and reliability of the information they share. This study investigates the patterns and behaviours of finfluencers in the UK on TikTok, focusing not on individual actions but on broader trends and the interactions between influencers and their followers. The aim is to identify common engagement patterns and propose guidelines that can help protect the public from potential financial harm. Specifically, the paper contributes a detailed analysis of finfluencer content categorization, sentiment trends, and the prevalence and role of disclaimers, offering empirical insights that inform recommendations for safer and more transparent financial communication on social media.

Title: Multi-Scale Graph Learning for Anti-Sparse Downscaling

Authors: Yingda Fan, Runlong Yu, Janet R. Barclay, Alison P. Appling, Yiming Sun, Yiqun Xie, Xiaowei Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01948
Pdf URL: https://arxiv.org/pdf/2505.01948
Copy Paste: [[2505.01948]] Multi-Scale Graph Learning for Anti-Sparse Downscaling(https://arxiv.org/abs/2505.01948)
Keywords: protect
Abstract: Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, $\leq$ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that this http URL address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.

Title: Segment Any RGB-Thermal Model with Language-aided Distillation

Authors: Dong Xing, Xianxun Zhu, Wei Zhou, Qika Lin, Hang Yang, Yuqing Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01950
Pdf URL: https://arxiv.org/pdf/2505.01950
Copy Paste: [[2505.01950]] Segment Any RGB-Thermal Model with Language-aided Distillation(https://arxiv.org/abs/2505.01950)
Keywords: robust, segmentation
Abstract: The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions.

Title: A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

Authors: Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, Xinya Du
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.01958
Pdf URL: https://arxiv.org/pdf/2505.01958
Copy Paste: [[2505.01958]] A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models(https://arxiv.org/abs/2505.01958)
Keywords: large language model
Abstract: Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitigation of visual hallucinations, but the underlying causes have not been comprehensively investigated. In this paper, we analyze each component of LLaVA-like LVLMs -- the large language model, the vision backbone, and the projector -- to identify potential sources of error and their impact. Based on our observations, we propose methods to mitigate hallucination for each problematic component. Additionally, we developed two hallucination benchmarks: QA-VisualGenome, which emphasizes attribute and relation hallucinations, and QA-FB15k, which focuses on cognition-based hallucinations.

Title: EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting

Authors: Leyi Yan, Linda Wang, Sihang Liu, Yi Ding
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01959
Pdf URL: https://arxiv.org/pdf/2505.01959
Copy Paste: [[2505.01959]] EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting(https://arxiv.org/abs/2505.01959)
Keywords: robust, interpretability
Abstract: Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these limitations, we introduce EnsembleCI, an adaptive, end-to-end ensemble learning-based approach for CI forecasting. EnsembleCI combines weighted predictions from multiple sublearners, offering enhanced flexibility and regional adaptability. In evaluations across 11 regional grids, EnsembleCI consistently surpasses CarbonCast, achieving the lowest mean absolute percentage error (MAPE) in almost all grids and improving prediction accuracy by an average of 19.58%. While performance still varies across grids due to inherent regional diversity, EnsembleCI reduces variability and exhibits greater robustness in long-term forecasting compared to CarbonCast and identifies region-specific key features, underscoring its interpretability and practical relevance. These findings position EnsembleCI as a more accurate and reliable solution for CI forecasting. EnsembleCI source code and data used in this paper are available at this https URL.

Title: Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview

Authors: Jiatao Li, Yanheng Li, Xiaojun Wan
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.01967
Pdf URL: https://arxiv.org/pdf/2505.01967
Copy Paste: [[2505.01967]] Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview(https://arxiv.org/abs/2505.01967)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.

Title: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

Authors: Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01969
Pdf URL: https://arxiv.org/pdf/2505.01969
Copy Paste: [[2505.01969]] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection(https://arxiv.org/abs/2505.01969)
Keywords: robust
Abstract: 3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1\% and 9.3\% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The source code will be released upon acceptance.

Title: Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques

Authors: Anthony Dontoh, Stephanie Ivey, Logan Sirbaugh, Andrews Danyo, Armstrong Aboah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01973
Pdf URL: https://arxiv.org/pdf/2505.01973
Copy Paste: [[2505.01973]] Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques(https://arxiv.org/abs/2505.01973)
Keywords: privacy, robust
Abstract: Distracted driving continues to be a significant cause of road traffic injuries and fatalities worldwide, even with advancements in driver monitoring technologies. Recent developments in machine learning (ML) and deep learning (DL) have primarily focused on visual data to detect distraction, often neglecting the complex, multimodal nature of driver behavior. This systematic review assesses 74 peer-reviewed studies from 2019 to 2024 that utilize ML/DL techniques for distracted driving detection across visual, sensor-based, multimodal, and emerging modalities. The review highlights a significant prevalence of visual-only models, particularly convolutional neural networks (CNNs) and temporal architectures, which achieve high accuracy but show limited generalizability in real-world scenarios. Sensor-based and physiological models provide complementary strengths by capturing internal states and vehicle dynamics, while emerging techniques, such as auditory sensing and radio frequency (RF) methods, offer privacy-aware alternatives. Multimodal architecture consistently surpasses unimodal baselines, demonstrating enhanced robustness, context awareness, and scalability by integrating diverse data streams. These findings emphasize the need to move beyond visual-only approaches and adopt multimodal systems that combine visual, physiological, and vehicular cues while keeping in checking the need to balance computational requirements. Future research should focus on developing lightweight, deployable multimodal frameworks, incorporating personalized baselines, and establishing cross-modality benchmarks to ensure real-world reliability in advanced driver assistance systems (ADAS) and road safety interventions.

Title: A Survey on Privacy Risks and Protection in Large Language Models

Authors: Kang Chen, Xiuze Zhou, Yuanguo Lin, Shibo Feng, Li Shen, Pengcheng Wu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.01976
Pdf URL: https://arxiv.org/pdf/2505.01976
Copy Paste: [[2505.01976]] A Survey on Privacy Risks and Protection in Large Language Models(https://arxiv.org/abs/2505.01976)
Keywords: secure, privacy, protect, attack, extraction, membership infer, federate, large language model
Abstract: Although Large Language Models (LLMs) have become increasingly integral to diverse applications, their capabilities raise significant privacy concerns. This survey offers a comprehensive overview of privacy risks associated with LLMs and examines current solutions to mitigate these challenges. First, we analyze privacy leakage and attacks in LLMs, focusing on how these models unintentionally expose sensitive information through techniques such as model inversion, training data extraction, and membership inference. We investigate the mechanisms of privacy leakage, including the unauthorized extraction of training data and the potential exploitation of these vulnerabilities by malicious actors. Next, we review existing privacy protection against such risks, such as inference detection, federated learning, backdoor mitigation, and confidential computing, and assess their effectiveness in preventing privacy leakage. Furthermore, we highlight key practical challenges and propose future research directions to develop secure and privacy-preserving LLMs, emphasizing privacy risk assessment, secure knowledge transfer between models, and interdisciplinary frameworks for privacy governance. Ultimately, this survey aims to establish a roadmap for addressing escalating privacy challenges in the LLMs domain.

Title: LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load

Authors: Theo Guidroz, Diego Ardila, Jimmy Li, Adam Mansour, Paul Jhun, Nina Gonzalez, Xiang Ji, Mike Sanchez, Sujay Kakarmath, Mathias MJ Bellaiche, Miguel Ángel Garrido, Faruk Ahmed, Divyansh Choudhary, Jay Hartford, Chenwei Xu, Henry Javier Serrano Echeverria, Yifan Wang, Jeff Shaffer, Eric (Yifan)Cao, Yossi Matias, Avinatan Hassidim, Dale R Webster, Yun Liu, Sho Fujiwara, Peggy Bui, Quang Duong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01980
Pdf URL: https://arxiv.org/pdf/2505.01980
Copy Paste: [[2505.01980]] LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load(https://arxiv.org/abs/2505.01980)
Keywords: robust
Abstract: Information on the web, such as scientific publications and Wikipedia, often surpasses users' reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p<0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants' self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p<0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility.

Title: Always Skip Attention

Authors: Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.01996
Pdf URL: https://arxiv.org/pdf/2505.01996
Copy Paste: [[2505.01996]] Always Skip Attention(https://arxiv.org/abs/2505.01996)
Keywords: transformer
Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

Title: Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

Authors: Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, Li Shen
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.01997
Pdf URL: https://arxiv.org/pdf/2505.01997
Copy Paste: [[2505.01997]] Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach(https://arxiv.org/abs/2505.01997)
Keywords: large language model
Abstract: One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.

Title: Triple-identity Authentication: The Future of Secure Access

Authors: Suyun Borjigin
Subjects: cs.CR, cs.ET, cs.HC, eess.SY
Abstract URL: https://arxiv.org/abs/2505.02004
Pdf URL: https://arxiv.org/pdf/2505.02004
Copy Paste: [[2505.02004]] Triple-identity Authentication: The Future of Secure Access(https://arxiv.org/abs/2505.02004)
Keywords: secure, security, robust
Abstract: In a typical authentication process, the local system verifies the user's identity using a stored hash value generated by a cross-system hash algorithm. This article shifts the research focus from traditional password encryption to the establishment of gatekeeping mechanisms for effective interactions between a system and the outside world. Here, we propose a triple-identity authentication system to achieve this goal. Specifically, this local system opens the inner structure of its hash algorithm to all user credentials, including the login name, login password, and authentication password. When a login credential is entered, the local system hashes it and then creates a unique identifier using intermediate hash elements randomly selected from the open algorithm. Importantly, this locally generated unique identifier (rather than the stored hash produced by the open algorithm) is utilized to verify the user's combined identity, which is generated by combining the entered credential with the International Mobile Equipment Identity and the International Mobile Subscriber Identity. The verification process is implemented at each interaction point: the login name field, the login password field, and the server's authentication point. Thus, within the context of this triple-identity authentication system, we establish a robust gatekeeping mechanism for system interactions, ultimately providing a level of security that is equivalent to multi-factor authentication.

Title: Efficient Noise Calculation in Deep Learning-based MRI Reconstructions

Authors: Onat Dalmaz, Arjun D. Desai, Reinhard Heckel, Tolga Çukur, Akshay S. Chaudhari, Brian A. Hargreaves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02007
Pdf URL: https://arxiv.org/pdf/2505.02007
Copy Paste: [[2505.02007]] Efficient Noise Calculation in Deep Learning-based MRI Reconstructions(https://arxiv.org/abs/2505.02007)
Keywords: robust
Abstract: Accelerated MRI reconstruction involves solving an ill-posed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculate voxel-wise variance for quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach approximates noise covariance using the DL network's Jacobian, which is intractable to calculate. To circumvent this, we derive an unbiased estimator for the diagonal of this covariance matrix (voxel-wise variance) and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data- and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our work reintroduces accurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI. Our code will be made publicly available upon acceptance.

Title: Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Authors: Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02009
Pdf URL: https://arxiv.org/pdf/2505.02009
Copy Paste: [[2505.02009]] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs(https://arxiv.org/abs/2505.02009)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. Upon publishing, we will also opensource our model signal on the entire C4 dataset. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.

Title: CASA: CNN Autoencoder-based Score Attention for Efficient Multivariate Long-term Time-series Forecasting

Authors: Minhyuk Lee, HyeKyung Yoon, MyungJoo Kang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02011
Pdf URL: https://arxiv.org/pdf/2505.02011
Copy Paste: [[2505.02011]] CASA: CNN Autoencoder-based Score Attention for Efficient Multivariate Long-term Time-series Forecasting(https://arxiv.org/abs/2505.02011)
Keywords: transformer
Abstract: Multivariate long-term time series forecasting is critical for applications such as weather prediction, and traffic analysis. In addition, the implementation of Transformer variants has improved prediction accuracy. Following these variants, different input data process approaches also enhanced the field, such as tokenization techniques including point-wise, channel-wise, and patch-wise tokenization. However, previous studies still have limitations in time complexity, computational resources, and cross-dimensional interactions. To address these limitations, we introduce a novel CNN Autoencoder-based Score Attention mechanism (CASA), which can be introduced in diverse Transformers model-agnosticically by reducing memory and leading to improvement in model performance. Experiments on eight real-world datasets validate that CASA decreases computational resources by up to 77.7%, accelerates inference by 44.0%, and achieves state-of-the-art performance, ranking first in 87.5% of evaluated metrics.

Title: MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

Authors: Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian Liu, Haoyuan Zhang, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02013
Pdf URL: https://arxiv.org/pdf/2505.02013
Copy Paste: [[2505.02013]] MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution(https://arxiv.org/abs/2505.02013)
Keywords: explainability, large language model
Abstract: Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.

Title: Wide & Deep Learning for Node Classification

Authors: Yancheng Chen, Wenguo Yang, Zhipeng Jiang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02020
Pdf URL: https://arxiv.org/pdf/2505.02020
Copy Paste: [[2505.02020]] Wide & Deep Learning for Node Classification(https://arxiv.org/abs/2505.02020)
Keywords: large language model
Abstract: Wide & Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide & Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at this https URL.

Title: Secrets of GFlowNets' Learning Behavior: A Theoretical Study

Authors: Tianshu Yu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02035
Pdf URL: https://arxiv.org/pdf/2505.02035
Copy Paste: [[2505.02035]] Secrets of GFlowNets' Learning Behavior: A Theoretical Study(https://arxiv.org/abs/2505.02035)
Keywords: robust, generative
Abstract: Generative Flow Networks (GFlowNets) have emerged as a powerful paradigm for generating composite structures, demonstrating considerable promise across diverse applications. While substantial progress has been made in exploring their modeling validity and connections to other generative frameworks, the theoretical understanding of their learning behavior remains largely uncharted. In this work, we present a rigorous theoretical investigation of GFlowNets' learning behavior, focusing on four fundamental dimensions: convergence, sample complexity, implicit regularization, and robustness. By analyzing these aspects, we seek to elucidate the intricate mechanisms underlying GFlowNet's learning dynamics, shedding light on its strengths and limitations. Our findings contribute to a deeper understanding of the factors influencing GFlowNet performance and provide insights into principled guidelines for their effective design and deployment. This study not only bridges a critical gap in the theoretical landscape of GFlowNets but also lays the foundation for their evolution as a reliable and interpretable framework for generative modeling. Through this, we aspire to advance the theoretical frontiers of GFlowNets and catalyze their broader adoption in the AI community.

Title: Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction

Authors: Cheng Wang, Xinzhu Ma, Bin Wang, Shixiang Tang, Yuan Meng, Ping Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02043
Pdf URL: https://arxiv.org/pdf/2505.02043
Copy Paste: [[2505.02043]] Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction(https://arxiv.org/abs/2505.02043)
Keywords: transformer, segmentation
Abstract: Recovering CAD models from point clouds, especially the sketch-extrusion process, can be seen as the process of rebuilding the topology and extrusion primitives. Previous methods utilize implicit fields for sketch representation, leading to shape reconstruction of curved edges. In this paper, we proposed a CAD reconstruction network that produces editable CAD models from input point clouds (Point2Primitive) by directly predicting every element of the extrusion primitives. Point2Primitive can directly detect and predict sketch curves (type and parameter) from point clouds based on an improved transformer. The sketch curve parameters are formulated as position queries and optimized in an autoregressive way, leading to high parameter accuracy. The topology is rebuilt by extrusion segmentation, and each extrusion parameter (sketch and extrusion operation) is recovered by combining the predicted curves and the computed extrusion operation. Extensive experiments demonstrate that our method is superior in primitive prediction accuracy and CAD reconstruction. The reconstructed shapes are of high geometrical fidelity.

Title: Regression s all you need for medical image translation

Authors: Sebastian Rassmann, David Kügler, Christian Ewert, Martin Reuter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02048
Pdf URL: https://arxiv.org/pdf/2505.02048
Copy Paste: [[2505.02048]] Regression s all you need for medical image translation(https://arxiv.org/abs/2505.02048)
Keywords: diffusion, generative
Abstract: The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.

Title: Transforming faces into video stories -- VideoFace2.0

Authors: Branko Brkljač, Vladimir Kalušev, Branislav Popović, Milan Sečujski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02060
Pdf URL: https://arxiv.org/pdf/2505.02060
Copy Paste: [[2505.02060]] Transforming faces into video stories -- VideoFace2.0(https://arxiv.org/abs/2505.02060)
Keywords: robust
Abstract: Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.

Title: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Xuming Hu, Ying Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02064
Pdf URL: https://arxiv.org/pdf/2505.02064
Copy Paste: [[2505.02064]] RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video(https://arxiv.org/abs/2505.02064)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: this https URL.

Title: Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning

Authors: Can Küçüksözen, Yücel Yemez
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02071
Pdf URL: https://arxiv.org/pdf/2505.02071
Copy Paste: [[2505.02071]] Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning(https://arxiv.org/abs/2505.02071)
Keywords: segmentation
Abstract: We propose the Compact Clustering Attention (COCA) layer, an effective building block that introduces a hierarchical strategy for object-centric representation learning, while solving the unsupervised object discovery task on single images. COCA is an attention-based clustering module capable of extracting object-centric representations from multi-object scenes, when cascaded into a bottom-up hierarchical network architecture, referred to as COCA-Net. At its core, COCA utilizes a novel clustering algorithm that leverages the physical concept of compactness, to highlight distinct object centroids in a scene, providing a spatial inductive bias. Thanks to this strategy, COCA-Net generates high-quality segmentation masks on both the decoder side and, notably, the encoder side of its pipeline. Additionally, COCA-Net is not bound by a predetermined number of object masks that it generates and handles the segmentation of background elements better than its competitors. We demonstrate COCA-Net's segmentation performance on six widely adopted datasets, achieving superior or competitive results against the state-of-the-art models across nine different evaluation metrics.

Title: Lightweight Defense Against Adversarial Attacks in Time Series Classification

Authors: Yi Han (Independent Researcher, Australia)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02073
Pdf URL: https://arxiv.org/pdf/2505.02073
Copy Paste: [[2505.02073]] Lightweight Defense Against Adversarial Attacks in Time Series Classification(https://arxiv.org/abs/2505.02073)
Keywords: defense, attack, robust
Abstract: As time series classification (TSC) gains prominence, ensuring robust TSC models against adversarial attacks is crucial. While adversarial defense is well-studied in Computer Vision (CV), the TSC field has primarily relied on adversarial training (AT), which is computationally expensive. In this paper, five data augmentation-based defense methods tailored for time series are developed, with the most computationally intensive method among them increasing the computational resources by only 14.07% compared to the original TSC model. Moreover, the deployment process for these methods is straightforward. By leveraging these advantages of our methods, we create two combined methods. One of these methods is an ensemble of all the proposed techniques, which not only provides better defense performance than PGD-based AT but also enhances the generalization ability of TSC models. Moreover, the computational resources required for our ensemble are less than one-third of those required for PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as foundation models are increasingly explored for time series feature learning, our work provides insights into integrating data augmentation-based adversarial defense with large-scale pre-trained models in future research.

Title: Learning Local Causal World Models with State Space Models and Attention

Authors: Francesco Petri, Luigi Asprino, Aldo Gangemi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02074
Pdf URL: https://arxiv.org/pdf/2505.02074
Copy Paste: [[2505.02074]] Learning Local Causal World Models with State Space Models and Attention(https://arxiv.org/abs/2505.02074)
Keywords: transformer
Abstract: World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Despite their impressive performance, many solutions fail to learn a causal representation of the environment they are trying to model, which would be necessary to gain a deep enough understanding of the world to perform complex tasks. With this work, we aim to broaden the research in the intersection of causality theory and neural world modelling by assessing the potential for causal discovery of the State Space Model (SSM) architecture, which has been shown to have several advantages over the widespread Transformer. We show empirically that, compared to an equivalent Transformer, a SSM can model the dynamics of a simple environment and learn a causal model at the same time with equivalent or better performance, thus paving the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.

Title: Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation

Authors: Volodymyr Havrylov, Haiwen Huang, Dan Zhang, Andreas Geiger
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02075
Pdf URL: https://arxiv.org/pdf/2505.02075
Copy Paste: [[2505.02075]] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation(https://arxiv.org/abs/2505.02075)
Keywords: segmentation
Abstract: Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at this https URL

Title: Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents

Authors: Christian Schroeder de Witt
Subjects: cs.CR, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.02077
Pdf URL: https://arxiv.org/pdf/2505.02077
Copy Paste: [[2505.02077]] Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents(https://arxiv.org/abs/2505.02077)
Keywords: secure, security, privacy, defense, attack, steal
Abstract: Decentralized AI agents will soon interact across internet platforms, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI's task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversightcreating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce \textbf{multi-agent security}, a new field dedicated to securing networks of decentralized AI agents against threats that emerge or amplify through their interactionswhether direct or indirect via shared environmentswith each other, humans, and institutions, and characterize fundamental security-performance trade-offs. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) surveys security-performance tradeoffs in decentralized AI systems, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment on the internet, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.

Title: LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning

Authors: Joy Lim Jia Yin, Daniel Zhang-Li, Jifan Yu, Haoxuan Li, Shangqing Tu, Yuanchun Wang, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02078
Pdf URL: https://arxiv.org/pdf/2505.02078
Copy Paste: [[2505.02078]] LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning(https://arxiv.org/abs/2505.02078)
Keywords: large language model
Abstract: Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at this https URL.

Title: LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications

Authors: Xinyue Peng, Yanming Liu, Yihan Cang, Chaoqun Cao, Ming Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02091
Pdf URL: https://arxiv.org/pdf/2505.02091
Copy Paste: [[2505.02091]] LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications(https://arxiv.org/abs/2505.02091)
Keywords: robust, large language model
Abstract: Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.

Title: SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

Authors: Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, Qifeng Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02094
Pdf URL: https://arxiv.org/pdf/2505.02094
Copy Paste: [[2505.02094]] SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations(https://arxiv.org/abs/2505.02094)
Keywords: robust
Abstract: We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.

Title: Deep Representation Learning for Electronic Design Automation

Authors: Pratik Shrestha, Saran Phatharodom, Alec Aversa, David Blankenship, Zhengfeng Wu, Ioannis Savidis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02105
Pdf URL: https://arxiv.org/pdf/2505.02105
Copy Paste: [[2505.02105]] Deep Representation Learning for Electronic Design Automation(https://arxiv.org/abs/2505.02105)
Keywords: extraction
Abstract: Representation learning has become an effective technique utilized by electronic design automation (EDA) algorithms, which leverage the natural representation of workflow elements as images, grids, and graphs. By addressing challenges related to the increasing complexity of circuits and stringent power, performance, and area (PPA) requirements, representation learning facilitates the automatic extraction of meaningful features from complex data formats, including images, grids, and graphs. This paper examines the application of representation learning in EDA, covering foundational concepts and analyzing prior work and case studies on tasks that include timing prediction, routability analysis, and automated placement. Key techniques, including image-based methods, graph-based approaches, and hybrid multimodal solutions, are presented to illustrate the improvements provided in routing, timing, and parasitic prediction. The provided advancements demonstrate the potential of representation learning to enhance efficiency, accuracy, and scalability in current integrated circuit design flows.

Title: GRAIL: Graph Edit Distance and Node Alignment Using LLM-Generated Code

Authors: Samidha Verma, Arushi Goyal, Ananya Mathur, Ankit Anand, Sayan Ranu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02124
Pdf URL: https://arxiv.org/pdf/2505.02124
Copy Paste: [[2505.02124]] GRAIL: Graph Edit Distance and Node Alignment Using LLM-Generated Code(https://arxiv.org/abs/2505.02124)
Keywords: robust, interpretability, large language model
Abstract: Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate a program that is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.

Title: Efficient Multivariate Time Series Forecasting via Calibrated Language Models with Privileged Knowledge Distillation

Authors: Chenxi Liu, Shaowen Zhou, Hao Miao, Qianxiong Xu, Cheng Long, Ziyue Li, Rui Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02138
Pdf URL: https://arxiv.org/pdf/2505.02138
Copy Paste: [[2505.02138]] Efficient Multivariate Time Series Forecasting via Calibrated Language Models with Privileged Knowledge Distillation(https://arxiv.org/abs/2505.02138)
Keywords: large language model
Abstract: Multivariate time series forecasting (MTSF) endeavors to predict future observations given historical data, playing a crucial role in time series data management systems. With advancements in large language models (LLMs), recent studies employ textual prompt tuning to infuse the knowledge of LLMs into MTSF. However, the deployment of LLMs often suffers from low efficiency during the inference phase. To address this problem, we introduce TimeKD, an efficient MTSF framework that leverages the calibrated language models and privileged knowledge distillation. TimeKD aims to generate high-quality future representations from the proposed cross-modality teacher model and cultivate an effective student model. The cross-modality teacher model adopts calibrated language models (CLMs) with ground truth prompts, motivated by the paradigm of Learning Under Privileged Information (LUPI). In addition, we design a subtractive cross attention (SCA) mechanism to refine these representations. To cultivate an effective student model, we propose an innovative privileged knowledge distillation (PKD) mechanism including correlation and feature distillation. PKD enables the student to replicate the teacher's behavior while minimizing their output discrepancy. Extensive experiments on real data offer insight into the effectiveness, efficiency, and scalability of the proposed TimeKD.

Title: Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02142
Pdf URL: https://arxiv.org/pdf/2505.02142
Copy Paste: [[2505.02142]] Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study(https://arxiv.org/abs/2505.02142)
Keywords: large language model
Abstract: Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.

Title: QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Authors: Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen
Subjects: cs.CL, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2505.02146
Pdf URL: https://arxiv.org/pdf/2505.02146
Copy Paste: [[2505.02146]] QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach(https://arxiv.org/abs/2505.02146)
Keywords: large language model
Abstract: Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.

Title: Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora

Authors: Prajwal Thapa, Mridul Sharma, Jinu Nyachhyon, Yagya Raj Pandeya
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02147
Pdf URL: https://arxiv.org/pdf/2505.02147
Copy Paste: [[2505.02147]] Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora(https://arxiv.org/abs/2505.02147)
Keywords: robust, transformer
Abstract: Herb classification presents a critical challenge in botanical research, particularly in regions with rich biodiversity such as Nepal. This study introduces a novel deep learning approach for classifying 60 different herb species using Convolutional Neural Networks (CNNs) and transfer learning techniques. Using a manually curated dataset of 12,000 herb images, we developed a robust machine learning model that addresses existing limitations in herb recognition methodologies. Our research employed multiple model architectures, including DenseNet121, 50-layer Residual Network (ResNet50), 16-layer Visual Geometry Group Network (VGG16), InceptionV3, EfficientNetV2, and Vision Transformer (VIT), with DenseNet121 ultimately demonstrating superior performance. Data augmentation and regularization techniques were applied to mitigate overfitting and enhance the generalizability of the model. This work advances herb classification techniques, preserving traditional botanical knowledge and promoting sustainable herb utilization.

Title: Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Authors: Alexey Nekrasov, Malcolm Burdorf, Stewart Worrall, Bastian Leibe, Julie Stephany Berrio Perez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02148
Pdf URL: https://arxiv.org/pdf/2505.02148
Copy Paste: [[2505.02148]] Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving(https://arxiv.org/abs/2505.02148)
Keywords: segmentation
Abstract: To operate safely, autonomous vehicles (AVs) need to detect and handle unexpected objects or anomalies on the road. While significant research exists for anomaly detection and segmentation in 2D, research progress in 3D is underexplored. Existing datasets lack high-quality multimodal data that are typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and evaluated several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly available, facilitating the testing and performance comparison of different approaches.

Title: Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

Authors: Xingyu Zhou, Wei Long, Jingbo Lu, Shiyin Jiang, Weiyi You, Haifeng Wu, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02159
Pdf URL: https://arxiv.org/pdf/2505.02159
Copy Paste: [[2505.02159]] Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution(https://arxiv.org/abs/2505.02159)
Keywords: transformer
Abstract: Video super-resolution (VSR) can achieve better performance compared to single image super-resolution by additionally leveraging temporal information. In particular, the recurrent-based VSR model exploits long-range temporal information during inference and achieves superior detail restoration. However, effectively learning these long-term dependencies within long videos remains a key challenge. To address this, we propose LRTI-VSR, a novel training framework for recurrent VSR that efficiently leverages Long-Range Refocused Temporal Information. Our framework includes a generic training strategy that utilizes temporal propagation features from long video clips while training on shorter video clips. Additionally, we introduce a refocused intra&inter-frame transformer block which allows the VSR model to selectively prioritize useful temporal information through its attention module while further improving inter-frame information utilization in the FFN module. We evaluate LRTI-VSR on both CNN and transformer-based VSR architectures, conducting extensive ablation studies to validate the contribution of each component. Experiments on long-video test sets demonstrate that LRTI-VSR achieves state-of-the-art performance while maintaining training and computational efficiency.

Title: Focus What Matters: Matchability-Based Reweighting for Local Feature Matching

Authors: Dongyue Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02161
Pdf URL: https://arxiv.org/pdf/2505.02161
Copy Paste: [[2505.02161]] Focus What Matters: Matchability-Based Reweighting for Local Feature Matching(https://arxiv.org/abs/2505.02161)
Keywords: transformer
Abstract: Since the rise of Transformers, many semi-dense matching methods have adopted attention mechanisms to extract feature descriptors. However, the attention weights, which capture dependencies between pixels or keypoints, are often learned from scratch. This approach can introduce redundancy and noisy interactions from irrelevant regions, as it treats all pixels or keypoints equally. Drawing inspiration from keypoint selection processes, we propose to first classify all pixels into two categories: matchable and non-matchable. Matchable pixels are expected to receive higher attention weights, while non-matchable ones are down-weighted. In this work, we propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits and applies a matchability-informed rescaling to the input value features. The bias term, injected prior to the softmax operation, selectively adjusts attention scores based on the confidence of query-key interactions. Concurrently, the feature rescaling acts post-attention by modulating the influence of each value vector in the final output. This dual design allows the attention mechanism to dynamically adjust both its internal weighting scheme and the magnitude of its output representations. Extensive experiments conducted on three benchmark datasets validate the effectiveness of our method, consistently outperforming existing state-of-the-art approaches.

Title: Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use

Authors: Justin Ho, Alexandra Colby, William Fisher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02164
Pdf URL: https://arxiv.org/pdf/2505.02164
Copy Paste: [[2505.02164]] Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use(https://arxiv.org/abs/2505.02164)
Keywords: fair
Abstract: This paper presents a domain-specific implementation of Retrieval-Augmented Generation (RAG) tailored to the Fair Use Doctrine in U.S. copyright law. Motivated by the increasing prevalence of DMCA takedowns and the lack of accessible legal support for content creators, we propose a structured approach that combines semantic search with legal knowledge graphs and court citation networks to improve retrieval quality and reasoning reliability. Our prototype models legal precedents at the statutory factor level (e.g., purpose, nature, amount, market effect) and incorporates citation-weighted graph representations to prioritize doctrinally authoritative sources. We use Chain-of-Thought reasoning and interleaved retrieval steps to better emulate legal reasoning. Preliminary testing suggests this method improves doctrinal relevance in the retrieval process, laying groundwork for future evaluation and deployment of LLM-based legal assistance tools.

Title: A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking

Authors: Henrik Brådland, Morten Goodwin, Per-Arne Andersen, Alexander S. Nossum, Aditya Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02171
Pdf URL: https://arxiv.org/pdf/2505.02171
Copy Paste: [[2505.02171]] A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking(https://arxiv.org/abs/2505.02171)
Keywords: large language model
Abstract: Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p > 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.

Title: Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization

Authors: Chuck Arvin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02172
Pdf URL: https://arxiv.org/pdf/2505.02172
Copy Paste: [[2505.02172]] Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization(https://arxiv.org/abs/2505.02172)
Keywords: large language model
Abstract: As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate ``scaling effects'' - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.

Title: Saliency-Guided Training for Fingerprint Presentation Attack Detection

Authors: Samuel Webster, Adam Czajka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02176
Pdf URL: https://arxiv.org/pdf/2505.02176
Copy Paste: [[2505.02176]] Saliency-Guided Training for Fingerprint Presentation Attack Detection(https://arxiv.org/abs/2505.02176)
Keywords: attack, biometric
Abstract: Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated "pseudosaliency," including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training's promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.

Title: Measuring Hong Kong Massive Multi-Task Language Understanding

Authors: Chuxue Cao, Zhenghao Zhu, Junqi Zhu, Guoying Lu, Siyu Peng, Juntao Dai, Weijie Shi, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02177
Pdf URL: https://arxiv.org/pdf/2505.02177
Copy Paste: [[2505.02177]] Measuring Hong Kong Massive Multi-Task Language Understanding(https://arxiv.org/abs/2505.02177)
Keywords: large language model
Abstract: Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong's unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong's linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75\%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs' capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.

Title: ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications

Authors: Tao Zhu, Qi Yu, Xinru Dong, Shiyu Li, Yue Liu, Jinlong Jiang, Lei Shu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02179
Pdf URL: https://arxiv.org/pdf/2505.02179
Copy Paste: [[2505.02179]] ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications(https://arxiv.org/abs/2505.02179)
Keywords: robust
Abstract: Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at this https URL.

Title: Robust AI-Generated Face Detection with Imbalanced Data

Authors: Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, Shu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02182
Pdf URL: https://arxiv.org/pdf/2505.02182
Copy Paste: [[2505.02182]] Robust AI-Generated Face Detection with Imbalanced Data(https://arxiv.org/abs/2505.02182)
Keywords: robust, transformer, generative
Abstract: Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at this https URL.

Title: DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Authors: Wenchuan Wang, Mengqi Huang, Yijing Tu, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02192
Pdf URL: https://arxiv.org/pdf/2505.02192
Copy Paste: [[2505.02192]] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization(https://arxiv.org/abs/2505.02192)
Keywords: diffusion, transformer
Abstract: Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce DualReal, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically selects a training phase (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion quality metrics.

Title: DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units

Authors: Lei Mao, Yuanhe Tian, Yan Song
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02206
Pdf URL: https://arxiv.org/pdf/2505.02206
Copy Paste: [[2505.02206]] DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units(https://arxiv.org/abs/2505.02206)
Keywords: transformer
Abstract: Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences, including small polymers and G-grams that are combinations of several contiguous polymers. Specifically, we extract the G-grams from large-scale genomic corpora through an unsupervised approach to construct the G-gram vocabulary, which is used to provide G-grams in the learning process of DNA sequences through dynamically matching from running gene samples. A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit (E4BU), which is responsible for encoding small units and maintaining the learning and inference process. To further enhance the learning process, we propose whole G-gram masking to train DNAZEN, where the model largely favors the selection of each entire G-gram to mask rather than an ordinary masking mechanism performed on basic units. Experiments on benchmark datasets demonstrate the effectiveness of DNAZEN on various downstream tasks.

Title: An Empirical Study of Qwen3 Quantization

Authors: Xingyu Zheng, Yuye Li, Haoran Chu, Yue Feng, Xudong Ma, Jie Luo, Jinyang Guo, Haotong Qin, Michele Magno, Xianglong Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02214
Pdf URL: https://arxiv.org/pdf/2505.02214
Copy Paste: [[2505.02214]] An Empirical Study of Qwen3 Quantization(https://arxiv.org/abs/2505.02214)
Keywords: robust, large language model
Abstract: The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3's performance remains underexplored. This study conducts a systematic evaluation of Qwen3's robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on this https URL and this https URL.

Title: Enhanced Outsourced and Secure Inference for Tall Sparse Decision Trees

Authors: Andrew Quijano, Spyros T. Halkidis, Kevin Gallagher, Kemal Akkaya, Nikolaos Samaras
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02224
Pdf URL: https://arxiv.org/pdf/2505.02224
Copy Paste: [[2505.02224]] Enhanced Outsourced and Secure Inference for Tall Sparse Decision Trees(https://arxiv.org/abs/2505.02224)
Keywords: secure, security, privacy, attack, steal
Abstract: A decision tree is an easy-to-understand tool that has been widely used for classification tasks. On the one hand, due to privacy concerns, there has been an urgent need to create privacy-preserving classifiers that conceal the user's input from the classifier. On the other hand, with the rise of cloud computing, data owners are keen to reduce risk by outsourcing their model, but want security guarantees that third parties cannot steal their decision tree model. To address these issues, Joye and Salehi introduced a theoretical protocol that efficiently evaluates decision trees while maintaining privacy by leveraging their comparison protocol that is resistant to timing attacks. However, their approach was not only inefficient but also prone to side-channel attacks. Therefore, in this paper, we propose a new decision tree inference protocol in which the model is shared and evaluated among multiple entities. We partition our decision tree model by each level to be stored in a new entity we refer to as a "level-site." Utilizing this approach, we were able to gain improved average run time for classifier evaluation for a non-complete tree, while also having strong mitigations against side-channel attacks.

Title: Risk Assessment and Threat Modeling for safe autonomous driving technology

Authors: Ian Alexis Wong Paz, Anuvinda Balan, Sebastian Campos, Ehud Orenstain, Sudip Dhakal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02231
Pdf URL: https://arxiv.org/pdf/2505.02231
Copy Paste: [[2505.02231]] Risk Assessment and Threat Modeling for safe autonomous driving technology(https://arxiv.org/abs/2505.02231)
Keywords: security
Abstract: This research paper delves into the field of autonomous vehicle technology, examining the vulnerabilities inherent in each component of these transformative vehicles. Autonomous vehicles (AVs) are revolutionizing transportation by seamlessly integrating advanced functionalities such as sensing, perception, planning, decision-making, and control. However, their reliance on interconnected systems and external communication interfaces renders them susceptible to cybersecurity threats. This research endeavors to develop a comprehensive threat model for AV systems, employing OWASP Threat Dragon and the STRIDE framework. This model categorizes threats into Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service (DoS), and Elevation of Privilege. A systematic risk assessment is conducted to evaluate vulnerabilities across various AV components, including perception modules, planning systems, control units, and communication interfaces.

Title: SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

Authors: Tanguy Herserant, Vincent Guigue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02235
Pdf URL: https://arxiv.org/pdf/2505.02235
Copy Paste: [[2505.02235]] SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation(https://arxiv.org/abs/2505.02235)
Keywords: robust, interpretability, explainability
Abstract: Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.

Title: Improving Physical Object State Representation in Text-to-Image Generative Systems

Authors: Tianle Chen, Chaitanya Chakka, Deepti Ghadiyaram
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02236
Pdf URL: https://arxiv.org/pdf/2505.02236
Copy Paste: [[2505.02236]] Improving Physical Object State Representation in Text-to-Image Generative Systems(https://arxiv.org/abs/2505.02236)
Keywords: generative
Abstract: Current text-to-image generative models struggle to accurately represent object states (e.g., "a table without a bottle," "an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.

Title: Federated Causal Inference in Healthcare: Methods, Challenges, and Applications

Authors: Haoyang Li, Jie Xu, Kyra Gan, Fei Wang, Chengxi Zang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02238
Pdf URL: https://arxiv.org/pdf/2505.02238
Copy Paste: [[2505.02238]] Federated Causal Inference in Healthcare: Methods, Challenges, and Applications(https://arxiv.org/abs/2505.02238)
Keywords: privacy, federate, fair
Abstract: Federated causal inference enables multi-site treatment effect estimation without sharing individual-level data, offering a privacy-preserving solution for real-world evidence generation. However, data heterogeneity across sites, manifested in differences in covariate, treatment, and outcome, poses significant challenges for unbiased and efficient estimation. In this paper, we present a comprehensive review and theoretical analysis of federated causal effect estimation across both binary/continuous and time-to-event outcomes. We classify existing methods into weight-based strategies and optimization-based frameworks and further discuss extensions including personalized models, peer-to-peer communication, and model decomposition. For time-to-event outcomes, we examine federated Cox and Aalen-Johansen models, deriving asymptotic bias and variance under heterogeneity. Our analysis reveals that FedProx-style regularization achieves near-optimal bias-variance trade-offs compared to naive averaging and meta-analysis. We review related software tools and conclude by outlining opportunities, challenges, and future directions for scalable, fair, and trustworthy federated causal inference in distributed healthcare systems.

Title: Performance Analysis and Deployment Considerations of Post-Quantum Cryptography for Consumer Electronics

Authors: Daniel Commey, Benjamin Appiah, Griffith S. Klogo, Winful Bagyl-Bac, James D. Gadze
Subjects: cs.CR, cs.PF
Abstract URL: https://arxiv.org/abs/2505.02239
Pdf URL: https://arxiv.org/pdf/2505.02239
Copy Paste: [[2505.02239]] Performance Analysis and Deployment Considerations of Post-Quantum Cryptography for Consumer Electronics(https://arxiv.org/abs/2505.02239)
Keywords: security
Abstract: Quantum computing threatens the security foundations of consumer electronics (CE). Preparing the diverse CE ecosystem, particularly resource-constrained devices, for the post-quantum era requires quantitative understanding of quantum-resistant cryptography (PQC) performance. This paper presents a comprehensive cross-platform performance analysis of leading PQC Key Encapsulation Mechanisms (KEMs) and digital signatures (NIST standards/candidates) compared against classical RSA/ECC. We evaluated execution time, communication costs (key/signature sizes), and memory footprint indicators on high-performance (macOS/M4, Ubuntu/x86) and constrained platforms (Raspberry Pi 4/ARM). Our quantitative results reveal lattice-based schemes, notably NIST standards ML-KEM (Kyber) and ML-DSA (Dilithium), provide a strong balance of computational efficiency and moderate communication/storage overhead, making them highly suitable for many CE applications. In contrast, code-based Classic McEliece imposes significant key size challenges, while hash-based SPHINCS+ offers high security assurance but demands large signature sizes impacting bandwidth and storage. Based on empirical data across platforms and security levels, we provide specific deployment recommendations tailored to different CE scenarios (e.g., wearables, smart home hubs, mobile devices), offering guidance for manufacturers navigating the PQC transition.

Title: Quantizing Diffusion Models from a Sampling-Aware Perspective

Authors: Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02242
Pdf URL: https://arxiv.org/pdf/2505.02242
Copy Paste: [[2505.02242]] Quantizing Diffusion Models from a Sampling-Aware Perspective(https://arxiv.org/abs/2505.02242)
Keywords: diffusion
Abstract: Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.

Title: RISE: Radius of Influence based Subgraph Extraction for 3D Molecular Graph Explanation

Authors: Jingxiang Qu, Wenhan Gao, Jiaxing Zhang, Xufeng Liu, Hua Wei, Haibin Ling, Yi Liu
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.02247
Pdf URL: https://arxiv.org/pdf/2505.02247
Copy Paste: [[2505.02247]] RISE: Radius of Influence based Subgraph Extraction for 3D Molecular Graph Explanation(https://arxiv.org/abs/2505.02247)
Keywords: extraction, interpretability
Abstract: 3D Geometric Graph Neural Networks (GNNs) have emerged as transformative tools for modeling molecular data. Despite their predictive power, these models often suffer from limited interpretability, raising concerns for scientific applications that require reliable and transparent insights. While existing methods have primarily focused on explaining molecular substructures in 2D GNNs, the transition to 3D GNNs introduces unique challenges, such as handling the implicit dense edge structures created by a cut-off radius. To tackle this, we introduce a novel explanation method specifically designed for 3D GNNs, which localizes the explanation to the immediate neighborhood of each node within the 3D space. Each node is assigned an radius of influence, defining the localized region within which message passing captures spatial and structural interactions crucial for the model's predictions. This method leverages the spatial and geometric characteristics inherent in 3D graphs. By constraining the subgraph to a localized radius of influence, the approach not only enhances interpretability but also aligns with the physical and structural dependencies typical of 3D graph applications, such as molecular learning.

Title: Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models

Authors: Paloma Piot, Patricia Martín-Rodilla, Javier Parapar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02252
Pdf URL: https://arxiv.org/pdf/2505.02252
Copy Paste: [[2505.02252]] Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models(https://arxiv.org/abs/2505.02252)
Keywords: large language model
Abstract: Commercial Large Language Models (LLMs) have recently incorporated memory features to deliver personalised responses. This memory retains details such as user demographics and individual characteristics, allowing LLMs to adjust their behaviour based on personal information. However, the impact of integrating personalised information into the context has not been thoroughly assessed, leading to questions about its influence on LLM behaviour. Personalisation can be challenging, particularly with sensitive topics. In this paper, we examine various state-of-the-art LLMs to understand their behaviour in different personalisation scenarios, specifically focusing on hate speech. We prompt the models to assume country-specific personas and use different languages for hate speech detection. Our findings reveal that context personalisation significantly influences LLMs' responses in this sensitive area. To mitigate these unwanted biases, we fine-tune the LLMs by penalising inconsistent hate speech classifications made with and without country or language-specific context. The refined models demonstrate improved performance in both personalised contexts and when no context is provided.

Title: Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset

Authors: Jakub Wąsala, Bartłomiej Wrzalski, Kornelia Noculak, Yuliia Tarasenko, Oliwer Krupa, Jan Kocoń, Grzegorz Chodak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02255
Pdf URL: https://arxiv.org/pdf/2505.02255
Copy Paste: [[2505.02255]] Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset(https://arxiv.org/abs/2505.02255)
Keywords: diffusion, generative
Abstract: This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.

Title: Parameter-Efficient Transformer Embeddings

Authors: Henry Ndubuaku, Mouad Talhi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02266
Pdf URL: https://arxiv.org/pdf/2505.02266
Copy Paste: [[2505.02266]] Parameter-Efficient Transformer Embeddings(https://arxiv.org/abs/2505.02266)
Keywords: transformer
Abstract: Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.

Title: Demystifying optimized prompts in language models

Authors: Rimon Melamed, Lucas H. McCabe, H. Howie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02273
Pdf URL: https://arxiv.org/pdf/2505.02273
Copy Paste: [[2505.02273]] Demystifying optimized prompts in language models(https://arxiv.org/abs/2505.02273)
Keywords: robust
Abstract: Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized'') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model's activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.

Title: Epistemic Wrapping for Uncertainty Quantification

Authors: Maryam Sultana, Neil Yorke-Smith, Kaizheng Wang, Shireen Kudukkil Manchingal, Muhammad Mubashar, Fabio Cuzzolin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02277
Pdf URL: https://arxiv.org/pdf/2505.02277
Copy Paste: [[2505.02277]] Epistemic Wrapping for Uncertainty Quantification(https://arxiv.org/abs/2505.02277)
Keywords: robust
Abstract: Uncertainty estimation is pivotal in machine learning, especially for classification tasks, as it improves the robustness and reliability of models. We introduce a novel `Epistemic Wrapping' methodology aimed at improving uncertainty estimation in classification. Our approach uses Bayesian Neural Networks (BNNs) as a baseline and transforms their outputs into belief function posteriors, effectively capturing epistemic uncertainty and offering an efficient and general methodology for uncertainty quantification. Comprehensive experiments employing a Bayesian Neural Network (BNN) baseline and an Interval Neural Network for inference on the MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 datasets demonstrate that our Epistemic Wrapper significantly enhances generalisation and uncertainty quantification.

Title: Entropy-Guided Sampling of Flat Modes in Discrete Spaces

Authors: Pinaki Mohanty, Riddhiman Bhattacharya, Ruqi Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02296
Pdf URL: https://arxiv.org/pdf/2505.02296
Copy Paste: [[2505.02296]] Entropy-Guided Sampling of Flat Modes in Discrete Spaces(https://arxiv.org/abs/2505.02296)
Keywords: robust, generative
Abstract: Sampling from flat modes in discrete spaces is a crucial yet underexplored problem. Flat modes represent robust solutions and have broad applications in combinatorial optimization and discrete generative modeling. However, existing sampling algorithms often overlook the mode volume and struggle to capture flat modes effectively. To address this limitation, we propose \emph{Entropic Discrete Langevin Proposal} (EDLP), which incorporates local entropy into the sampling process through a continuous auxiliary variable under a joint distribution. The local entropy term guides the discrete sampler toward flat modes with a small overhead. We provide non-asymptotic convergence guarantees for EDLP in locally log-concave discrete distributions. Empirically, our method consistently outperforms traditional approaches across tasks that require sampling from flat basins, including Bernoulli distribution, restricted Boltzmann machines, combinatorial optimization, and binary neural networks.

Title: Adaptive Scoring and Thresholding with Human Feedback for Robust Out-of-Distribution Detection

Authors: Daisuke Yamada, Harit Vishwakarma, Ramya Korlakai Vinayak
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02299
Pdf URL: https://arxiv.org/pdf/2505.02299
Copy Paste: [[2505.02299]] Adaptive Scoring and Thresholding with Human Feedback for Robust Out-of-Distribution Detection(https://arxiv.org/abs/2505.02299)
Keywords: robust
Abstract: Machine Learning (ML) models are trained on in-distribution (ID) data but often encounter out-of-distribution (OOD) inputs during deployment -- posing serious risks in safety-critical domains. Recent works have focused on designing scoring functions to quantify OOD uncertainty, with score thresholds typically set based solely on ID data to achieve a target true positive rate (TPR), since OOD data is limited before deployment. However, these TPR-based thresholds leave false positive rates (FPR) uncontrolled, often resulting in high FPRs where OOD points are misclassified as ID. Moreover, fixed scoring functions and thresholds lack the adaptivity needed to handle newly observed, evolving OOD inputs, leading to sub-optimal performance. To address these challenges, we propose a human-in-the-loop framework that \emph{safely updates both scoring functions and thresholds on the fly} based on real-world OOD inputs. Our method maximizes TPR while strictly controlling FPR at all times, even as the system adapts over time. We provide theoretical guarantees for FPR control under stationary conditions and present extensive empirical evaluations on OpenOOD benchmarks to demonstrate that our approach outperforms existing methods by achieving higher TPRs while maintaining FPR control.

Title: Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Authors: Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02304
Pdf URL: https://arxiv.org/pdf/2505.02304
Copy Paste: [[2505.02304]] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition(https://arxiv.org/abs/2505.02304)
Keywords: robust, generative, large language model
Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

Title: Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Authors: Sanjay Surendranath Girija, Shashank Kapoor, Lakshit Arora, Dipen Pradhan, Aman Raj, Ankit Shetgaonkar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02309
Pdf URL: https://arxiv.org/pdf/2505.02309
Copy Paste: [[2505.02309]] Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques(https://arxiv.org/abs/2505.02309)
Keywords: large language model
Abstract: Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.

Title: Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

Authors: Jihao Zhao, Chunlai Zhou, Biao Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02311
Pdf URL: https://arxiv.org/pdf/2505.02311
Copy Paste: [[2505.02311]] Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering(https://arxiv.org/abs/2505.02311)
Keywords: transformer, large language model
Abstract: The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.

Title: VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Authors: Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2505.02331
Pdf URL: https://arxiv.org/pdf/2505.02331
Copy Paste: [[2505.02331]] VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection(https://arxiv.org/abs/2505.02331)
Keywords: large language model
Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage 1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage 2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

Title: 6D Pose Estimation on Spoons and Hands

Authors: Kevin Tan, Fan Yang, Yuhao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02335
Pdf URL: https://arxiv.org/pdf/2505.02335
Copy Paste: [[2505.02335]] 6D Pose Estimation on Spoons and Hands(https://arxiv.org/abs/2505.02335)
Keywords: segmentation
Abstract: Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods such as self-reporting. Hence, this paper implements a system that analyzes stationary video feed of people eating, using 6D pose estimation to track hand and spoon movements to capture spatial position and orientation. In doing so, we examine the performance of two state-of-the-art (SOTA) video object segmentation (VOS) models, both quantitatively and qualitatively, and identify main sources of error within the system.

Title: An End-to-End Model For Logits Based Large Language Models Watermarking

Authors: Kahim Wong, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02344
Pdf URL: https://arxiv.org/pdf/2505.02344
Copy Paste: [[2505.02344]] An End-to-End Model For Logits Based Large Language Models Watermarking(https://arxiv.org/abs/2505.02344)
Keywords: protect, robust, watermark, large language model
Abstract: The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37-39% under paraphrasing and 17.2% on average, while maintaining text quality on par with these distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs.

Title: Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training

Authors: Fares B. Mehouachi, Saif Eddin Jabari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02360
Pdf URL: https://arxiv.org/pdf/2505.02360
Copy Paste: [[2505.02360]] Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training(https://arxiv.org/abs/2505.02360)
Keywords: attack, robust
Abstract: Adversarial training is a cornerstone of robust deep learning, but fast methods like the Fast Gradient Sign Method (FGSM) often suffer from Catastrophic Overfitting (CO), where models become robust to single-step attacks but fail against multi-step variants. While existing solutions rely on noise injection, regularization, or gradient clipping, we propose a novel solution that purely controls the $l^p$ training norm to mitigate CO. Our study is motivated by the empirical observation that CO is more prevalent under the $l^{\infty}$ norm than the $l^2$ norm. Leveraging this insight, we develop a framework for generalized $l^p$ attack as a fixed point problem and craft $l^p$-FGSM attacks to understand the transition mechanics from $l^2$ to $l^{\infty}$. This leads to our core insight: CO emerges when highly concentrated gradients where information localizes in few dimensions interact with aggressive norm constraints. By quantifying gradient concentration through Participation Ratio and entropy measures, we develop an adaptive $l^p$-FGSM that automatically tunes the training norm based on gradient information. Extensive experiments demonstrate that this approach achieves strong robustness without requiring additional regularization or noise injection, providing a novel and theoretically-principled pathway to mitigate the CO problem.

Title: Advancing Email Spam Detection: Leveraging Zero-Shot Learning and Large Language Models

Authors: Ghazaleh SHirvani, Saeid Ghasemshirazi
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02362
Pdf URL: https://arxiv.org/pdf/2505.02362
Copy Paste: [[2505.02362]] Advancing Email Spam Detection: Leveraging Zero-Shot Learning and Large Language Models(https://arxiv.org/abs/2505.02362)
Keywords: security, robust, large language model
Abstract: Email spam detection is a critical task in modern communication systems, essential for maintaining productivity, security, and user experience. Traditional machine learning and deep learning approaches, while effective in static settings, face significant limitations in adapting to evolving spam tactics, addressing class imbalance, and managing data scarcity. These challenges necessitate innovative approaches that reduce dependency on extensive labeled datasets and frequent retraining. This study investigates the effectiveness of Zero-Shot Learning using FLAN-T5, combined with advanced Natural Language Processing (NLP) techniques such as BERT for email spam detection. By employing BERT to preprocess and extract critical information from email content, and FLAN-T5 to classify emails in a Zero-Shot framework, the proposed approach aims to address the limitations of traditional spam detection systems. The integration of FLAN-T5 and BERT enables robust spam detection without relying on extensive labeled datasets or frequent retraining, making it highly adaptable to unseen spam patterns and adversarial environments. This research highlights the potential of leveraging zero-shot learning and NLPs for scalable and efficient spam detection, providing insights into their capability to address the dynamic and challenging nature of spam detection tasks.

Title: Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks

Authors: Juyoung Yun
Subjects: cs.LG, cs.AI, cs.CV, cs.IT, cs.NE
Abstract URL: https://arxiv.org/abs/2505.02369
Pdf URL: https://arxiv.org/pdf/2505.02369
Copy Paste: [[2505.02369]] Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks(https://arxiv.org/abs/2505.02369)
Keywords: robust, transformer
Abstract: Generalizing well in deep neural networks remains a core challenge, particularly due to their tendency to converge to sharp minima that degrade robustness. Sharpness-Aware Minimization (SAM) mitigates this by seeking flatter minima but perturbs parameters using the full gradient, which can include statistically insignificant directions. We propose ZSharp, a simple yet effective extension to SAM that applies layer-wise Z-score normalization followed by percentile-based filtering to retain only statistically significant gradient components. This selective perturbation aligns updates with curvature-sensitive directions, enhancing generalization without requiring architectural changes. ZSharp introduces only one additional hyperparameter, the percentile threshold, and remains fully compatible with existing SAM variants. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet, VGG, and Vision Transformers show that ZSharp consistently outperforms SAM and its variants in test accuracy, particularly on deeper and transformer-based models. These results demonstrate that ZSharp is a principled and lightweight improvement for sharpness-aware optimization.

Title: EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Authors: Arnab Sanyal, Prithwish Mukherjee, Gourav Datta, Sandeep P. Chinchali
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02380
Pdf URL: https://arxiv.org/pdf/2505.02380
Copy Paste: [[2505.02380]] EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices(https://arxiv.org/abs/2505.02380)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate exceptional performance across various tasks, but their large storage and computational requirements constrain their deployment on edge devices. To address this, we propose EntroLLM, a novel compression framework that integrates mixed quantization with entropy coding to reduce storage overhead while maintaining model accuracy. Our method applies a layer-wise mixed quantization scheme - choosing between symmetric and asymmetric quantization based on individual layer weight distributions - to optimize compressibility. We then employ Huffman encoding for lossless compression of the quantized weights, significantly reducing memory bandwidth requirements. Furthermore, we introduce parallel Huffman decoding, which enables efficient retrieval of encoded weights during inference, ensuring minimal latency impact. Our experiments on edge-compatible LLMs, including smolLM-1.7B-Instruct, phi3-mini-4k-Instruct, and mistral-7B-Instruct, demonstrate that EntroLLM achieves up to $30%$ storage reduction compared to uint8 models and up to $65%$ storage reduction compared to uint4 models, while preserving perplexity and accuracy, on language benchmark tasks. We further show that our method enables $31.9%$ - $146.6%$ faster inference throughput on memory-bandwidth-limited edge devices, such as NVIDIA Jetson P3450, by reducing the required data movement. The proposed approach requires no additional re-training and is fully compatible with existing post-training quantization methods, making it a practical solution for edge LLMs.

Title: Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Authors: Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02383
Pdf URL: https://arxiv.org/pdf/2505.02383
Copy Paste: [[2505.02383]] Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret(https://arxiv.org/abs/2505.02383)
Keywords: privacy
Abstract: We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-\alpha)}\right)$-GDP and enjoys an $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bound, where $\alpha \in [0,1]$ controls the trade-off between privacy and regret. Theoretically, our DP-TS-UCB relies on anti-concentration bounds of Gaussian distributions and links exploration mechanisms in Thompson Sampling-based algorithms and Upper Confidence Bound-based algorithms, which may be of independent interest.

Title: RM-R1: Reward Modeling as Reasoning

Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02387
Pdf URL: https://arxiv.org/pdf/2505.02387
Copy Paste: [[2505.02387]] RM-R1: Reward Modeling as Reasoning(https://arxiv.org/abs/2505.02387)
Keywords: interpretability, generative, large language model
Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at this https URL.

Title: MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans

Authors: Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.02388
Pdf URL: https://arxiv.org/pdf/2505.02388
Copy Paste: [[2505.02388]] MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans(https://arxiv.org/abs/2505.02388)
Keywords: robust
Abstract: Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: this https URL.

Title: Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

Authors: Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02390
Pdf URL: https://arxiv.org/pdf/2505.02390
Copy Paste: [[2505.02390]] Quantitative Analysis of Performance Drop in DeepSeek Model Quantization(https://arxiv.org/abs/2505.02390)
Keywords: privacy
Abstract: Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at this https URL, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.

Title: Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Authors: Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02391
Pdf URL: https://arxiv.org/pdf/2505.02391
Copy Paste: [[2505.02391]] Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL(https://arxiv.org/abs/2505.02391)
Keywords: large language model
Abstract: Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at this https URL.

Title: Moneros Decentralized P2P Exchanges: Functionality, Adoption, and Privacy Risks

Authors: Yannik Kopyciok, Friedhelm Victor, Stefan Schmid
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02392
Pdf URL: https://arxiv.org/pdf/2505.02392
Copy Paste: [[2505.02392]] Moneros Decentralized P2P Exchanges: Functionality, Adoption, and Privacy Risks(https://arxiv.org/abs/2505.02392)
Keywords: secure, privacy
Abstract: Privacy-focused cryptocurrencies like Monero remain popular, despite increasing regulatory scrutiny that has led to their delisting from major centralized exchanges. The latter also explains the recent popularity of decentralized exchanges (DEXs) with no centralized ownership structures. These platforms typically leverage peer-to-peer (P2P) networks, promising secure and anonymous asset trading. However, questions of liability remain, and the academic literature lacks comprehensive insights into the functionality, trading activity, and privacy claims of these P2P platforms. In this paper, we provide an early systematization of the current landscape of decentralized peer-to-peer exchanges within the Monero ecosystem. We examine several recently developed DEX platforms, analyzing their popularity, functionality, architectural choices, and potential weaknesses. We further identify and report on a privacy vulnerability in the recently popularized Haveno exchange, demonstrating that certain Haveno trades could be detected, allowing transactions to be linked across the Monero and Bitcoin blockchains. We hope that our findings can nourish the discussion in the research community about more secure designs, and provide insights for regulators.

Title: Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection

Authors: Sungheon Jeong, Jihong Park, Mohsen Imani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02393
Pdf URL: https://arxiv.org/pdf/2505.02393
Copy Paste: [[2505.02393]] Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection(https://arxiv.org/abs/2505.02393)
Keywords: robust
Abstract: Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at this https URL.

Title: Token Coordinated Prompt Attention is Needed for Visual Prompting

Authors: Zichen Liu, Xu Zou, Gang Hua, Jiahuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02406
Pdf URL: https://arxiv.org/pdf/2505.02406
Copy Paste: [[2505.02406]] Token Coordinated Prompt Attention is Needed for Visual Prompting(https://arxiv.org/abs/2505.02406)
Keywords: extraction, transformer
Abstract: Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at this https URL.

Title: Encrypted Federated Search Using Homomorphic Encryption

Authors: Om Rathod, Aastha Baid, Aswani Kumar Cherukuri
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02409
Pdf URL: https://arxiv.org/pdf/2505.02409
Copy Paste: [[2505.02409]] Encrypted Federated Search Using Homomorphic Encryption(https://arxiv.org/abs/2505.02409)
Keywords: secure, security, privacy, federate
Abstract: The sharing of information between agencies is effective in dealing with cross-jurisdictional criminal activities; however, such sharing is often restricted due to concerns about data privacy, ownership, and compliance. Towards this end, this work has introduced a privacy-preserving federated search system that allows law enforcement agencies to conduct queries on encrypted criminal databases by utilizing Homomorphic Encryption (HE). The key innovation here is the ability to execute encrypted queries across distributed databases, without the decryption of the data, thus preserving end-to-end confidentiality. In essence, this approach meets stringent privacy requirements in the interests of national security and regulatory compliance. The system incorporates the CKKS and BFV scheme embedded within TenSEAL, with each agency holding its key pair in a centralized key management table. In this federated search, encrypted queries are computed on the server side, and only authorized clients can decrypt the computed results. The matching of agencies is flexible for working in real-time while at the same time being secure and scalable while preserving control over data and the integrity of the process. Experimental results demonstrate the model. This paper also provide the implementation code and other details.

Title: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models

Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02417
Pdf URL: https://arxiv.org/pdf/2505.02417
Copy Paste: [[2505.02417]] T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models(https://arxiv.org/abs/2505.02417)
Keywords: diffusion, transformer
Abstract: Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.

Title: Towards One-shot Federated Learning: Advances, Challenges, and Future Directions

Authors: Flora Amato, Lingyu Qiu, Mohammad Tanveer, Salvatore Cuomo, Fabio Giampaolo, Francesco Piccialli
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2505.02426
Pdf URL: https://arxiv.org/pdf/2505.02426
Copy Paste: [[2505.02426]] Towards One-shot Federated Learning: Advances, Challenges, and Future Directions(https://arxiv.org/abs/2505.02426)
Keywords: privacy, federate
Abstract: One-shot FL enables collaborative training in a single round, eliminating the need for iterative communication, making it particularly suitable for use in resource-constrained and privacy-sensitive applications. This survey offers a thorough examination of One-shot FL, highlighting its distinct operational framework compared to traditional federated approaches. One-shot FL supports resource-limited devices by enabling single-round model aggregation while maintaining data locality. The survey systematically categorizes existing methodologies, emphasizing advancements in client model initialization, aggregation techniques, and strategies for managing heterogeneous data distributions. Furthermore, we analyze the limitations of current approaches, particularly in terms of scalability and generalization in non-IID settings. By analyzing cutting-edge techniques and outlining open challenges, this survey aspires to provide a comprehensive reference for researchers and practitioners aiming to design and implement One-shot FL systems, advancing the development and adoption of One-shot FL solutions in a real-world, resource-constrained scenario.

Title: FairPO: Robust Preference Optimization for Fair Multi-Label Learning

Authors: Soumen Kumar Mondal, Akshit Varmora, Prateek Chanda, Ganesh Ramakrishnan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02433
Pdf URL: https://arxiv.org/pdf/2505.02433
Copy Paste: [[2505.02433]] FairPO: Robust Preference Optimization for Fair Multi-Label Learning(https://arxiv.org/abs/2505.02433)
Keywords: robust, fair
Abstract: We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true positive labels from confusing negatives within the privileged group, while preserving baseline classification performance for non-privileged labels. By framing the learning problem as a robust optimization over groups, our approach dynamically adjusts the training emphasis toward groups with poorer performance, thereby mitigating bias and ensuring a fairer treatment across diverse label categories. In addition, we outline plans to extend this approach by investigating alternative loss formulations such as Simple Preference Optimisation (SimPO) and Contrastive Preference Optimization (CPO) to exploit reference-free reward formulations and contrastive training signals. Furthermore, we plan to extend FairPO with multilabel generation capabilities, enabling the model to dynamically generate diverse and coherent label sets for ambiguous inputs.

Title: A New Approach to Backtracking Counterfactual Explanations: A Causal Framework for Efficient Model Interpretability

Authors: Pouria Fatemi, Ehsan Sharifian, Mohammad Hossein Yassaee
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02435
Pdf URL: https://arxiv.org/pdf/2505.02435
Copy Paste: [[2505.02435]] A New Approach to Backtracking Counterfactual Explanations: A Causal Framework for Efficient Model Interpretability(https://arxiv.org/abs/2505.02435)
Keywords: interpretability
Abstract: Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.

Title: Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Authors: Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02456
Pdf URL: https://arxiv.org/pdf/2505.02456
Copy Paste: [[2505.02456]] Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs(https://arxiv.org/abs/2505.02456)
Keywords: fair, large language model
Abstract: One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.

Title: Targeted Fuzzing for Unsafe Rust Code: Leveraging Selective Instrumentation

Authors: David Paaßen, Jens-Rene Giesen, Lucas Davi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02464
Pdf URL: https://arxiv.org/pdf/2505.02464
Copy Paste: [[2505.02464]] Targeted Fuzzing for Unsafe Rust Code: Leveraging Selective Instrumentation(https://arxiv.org/abs/2505.02464)
Keywords: security
Abstract: Rust is a promising programming language that focuses on concurrency, usability, and security. It is used in production code by major industry players and got recommended by government bodies. Rust provides strong security guarantees achieved by design utilizing the concepts of ownership and borrowing. However, Rust allows programmers to write unsafe code which is not subject to the strict Rust security policy. Empirical studies show that security issues in practice always involve code written in unsafe Rust. In this paper, we present the first approach that utilizes selective code coverage feedback to focus the fuzzing efforts on unsafe Rust code. Our approach significantly improves the efficiency when fuzzing Rust programs and does not require additional computational resources while fuzz testing the target. To quantify the impact of partial code instrumentation, we implement our approach by extending the capabilities of the Rust compiler toolchain. We present an automated approach to detect unsafe and safe code components to decide which parts of the program a fuzzer should focus on when running a fuzzing campaign to find vulnerabilities in Rust programs. Our approach is fully compatible with existing fuzzing implementations and does not require complex manual work, thus retaining the existing high usability standard. Focusing on unsafe code, our implementation allows us to generate inputs that trigger more unsafe code locations with statistical significance and therefore is able to detect potential vulnerabilities in a shorter time span while imposing no performance overhead during fuzzing itself.

Title: Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging

Authors: Valerio Guarrasi, Klara Mogensen, Sara Tassinari, Sara Qvarlander, Paolo Soda
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02467
Pdf URL: https://arxiv.org/pdf/2505.02467
Copy Paste: [[2505.02467]] Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging(https://arxiv.org/abs/2505.02467)
Keywords: robust
Abstract: Multimodal deep learning harnesses diverse imaging modalities, such as MRI sequences, to enhance diagnostic accuracy in medical imaging. A key challenge is determining the optimal timing for integrating these modalities-specifically, identifying the network layers where fusion modules should be inserted. Current approaches often rely on manual tuning or exhaustive search, which are computationally expensive without any guarantee of converging to optimal results. We propose a sequential forward search algorithm that incrementally activates and evaluates candidate fusion modules at different layers of a multimodal network. At each step, the algorithm retrains from previously learned weights and compares validation loss to identify the best-performing configuration. This process systematically reduces the search space, enabling efficient identification of the optimal fusion timing without exhaustively testing all possible module placements. The approach is validated on two multimodal MRI datasets, each addressing different classification tasks. Our algorithm consistently identified configurations that outperformed unimodal baselines, late fusion, and a brute-force ensemble of all potential fusion placements. These architectures demonstrated superior accuracy, F-score, and specificity while maintaining competitive or improved AUC values. Furthermore, the sequential nature of the search significantly reduced computational overhead, making the optimization process more practical. By systematically determining the optimal timing to fuse imaging modalities, our method advances multimodal deep learning for medical imaging. It provides an efficient and robust framework for fusion optimization, paving the way for improved clinical decision-making and more adaptable, scalable architectures in medical AI applications.

Title: Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Authors: Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02471
Pdf URL: https://arxiv.org/pdf/2505.02471
Copy Paste: [[2505.02471]] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction(https://arxiv.org/abs/2505.02471)
Keywords: diffusion
Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.

Title: Finger Pose Estimation for Under-screen Fingerprint Sensor

Authors: Xiongjun Guan, Zhiyu Pan, Jianjiang Feng, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02481
Pdf URL: https://arxiv.org/pdf/2505.02481
Copy Paste: [[2505.02481]] Finger Pose Estimation for Under-screen Fingerprint Sensor(https://arxiv.org/abs/2505.02481)
Keywords: extraction
Abstract: Two-dimensional pose estimation plays a crucial role in fingerprint recognition by facilitating global alignment and reduce pose-induced variations. However, existing methods are still unsatisfactory when handling with large angle or small area inputs. These limitations are particularly pronounced on fingerprints captured by under-screen fingerprint sensors in smartphones. In this paper, we present a novel dual-modal input based network for under-screen fingerprint pose estimation. Our approach effectively integrates two distinct yet complementary modalities: texture details extracted from ridge patches through the under-screen fingerprint sensor, and rough contours derived from capacitive images obtained via the touch screen. This collaborative integration endows our network with more comprehensive and discriminative information, substantially improving the accuracy and stability of pose estimation. A decoupled probability distribution prediction task is designed, instead of the traditional supervised forms of numerical regression or heatmap voting, to facilitate the training process. Additionally, we incorporate a Mixture of Experts (MoE) based feature fusion mechanism and a relationship driven cross-domain knowledge transfer strategy to further strengthen feature extraction and fusion capabilities. Extensive experiments are conducted on several public datasets and two private datasets. The results indicate that our method is significantly superior to previous state-of-the-art (SOTA) methods and remarkably boosts the recognition ability of fingerprint recognition algorithms. Our code is available at this https URL.

Title: SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning

Authors: Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ho Shing Ip, Sam Kwong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02486
Pdf URL: https://arxiv.org/pdf/2505.02486
Copy Paste: [[2505.02486]] SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning(https://arxiv.org/abs/2505.02486)
Keywords: large language model
Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting. In this paper, we explore forgetting in this context, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model's knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks' answer styles, making the results unusable. By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can obscure the model's knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for transforming data styles across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization, enabling the model to retain existing competencies. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.

Title: Bayesian Robust Aggregation for Federated Learning

Authors: Aleksandr Karakulev (1), Usama Zafar (1), Salman Toor (1 and 2), Prashant Singh (1 and 3) ((1) Uppsala University, (2) Scaleout Systems, (3) Science for Life Laboratory, Sweden)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02490
Pdf URL: https://arxiv.org/pdf/2505.02490
Copy Paste: [[2505.02490]] Bayesian Robust Aggregation for Federated Learning(https://arxiv.org/abs/2505.02490)
Keywords: attack, robust, federate
Abstract: Federated Learning enables collaborative training of machine learning models on decentralized data. This scheme, however, is vulnerable to adversarial attacks, when some of the clients submit corrupted model updates. In real-world scenarios, the total number of compromised clients is typically unknown, with the extent of attacks potentially varying over time. To address these challenges, we propose an adaptive approach for robust aggregation of model updates based on Bayesian inference. The mean update is defined by the maximum of the likelihood marginalized over probabilities of each client to be `honest'. As a result, the method shares the simplicity of the classical average estimators (e.g., sample mean or geometric median), being independent of the number of compromised clients. At the same time, it is as effective against attacks as methods specifically tailored to Federated Learning, such as Krum. We compare our approach with other aggregation schemes in federated setting on three benchmark image classification data sets. The proposed method consistently achieves state-of-the-art performance across various attack types with static and varying number of malicious clients.

Title: Dynamic Graph-based Fingerprinting of In-browser Cryptomining

Authors: Tanapoom Sermchaiwong, Jiasi Shen
Subjects: cs.CR, cs.PL
Abstract URL: https://arxiv.org/abs/2505.02493
Pdf URL: https://arxiv.org/pdf/2505.02493
Copy Paste: [[2505.02493]] Dynamic Graph-based Fingerprinting of In-browser Cryptomining(https://arxiv.org/abs/2505.02493)
Keywords: attack, robust
Abstract: The decentralized and unregulated nature of cryptocurrencies, combined with their monetary value, has made them a vehicle for various illicit activities. One such activity is cryptojacking, an attack that uses stolen computing resources to mine cryptocurrencies without consent for profit. In-browser cryptojacking malware exploits high-performance web technologies like WebAssembly to mine cryptocurrencies directly within the browser without file downloads. Although existing methods for cryptomining detection report high accuracy and low overhead, they are often susceptible to various forms of obfuscation, and due to the limited variety of cryptomining scripts in the wild, standard code obfuscation methods present a natural and appealing solution to avoid detection. To address these limitations, we propose using instruction-level data-flow graphs to detect cryptomining behavior. Data-flow graphs offer detailed structural insights into a program's computations, making them suitable for characterizing proof-of-work algorithms, but they can be difficult to analyze due to their large size and susceptibility to noise and fragmentation under obfuscation. We present two techniques to simplify and compare data-flow graphs: (1) a graph simplification algorithm to reduce the computational burden of processing large and granular data-flow graphs while preserving local substructures; and (2) a subgraph similarity measure, the n-fragment inclusion score, based on fragment inclusion that is robust against noise and obfuscation. Using data-flow graphs as computation fingerprints, our detection framework PoT (Proof-of-Theft) was able to achieve high detection accuracy against standard obfuscations, outperforming existing detection methods. Moreover, PoT uses generic data-flow properties that can be applied to other platforms more susceptible to cryptojacking such as servers and data centers.

Title: An Efficient Hybrid Key Exchange Mechanism

Authors: Benjamin D. Kim, Vipindev Adat Vasudevan, Alejandro Cohen, Rafael G. L. D'Oliveira, Thomas Stahlbuhk, Muriel Médard
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2505.02499
Pdf URL: https://arxiv.org/pdf/2505.02499
Copy Paste: [[2505.02499]] An Efficient Hybrid Key Exchange Mechanism(https://arxiv.org/abs/2505.02499)
Keywords: secure, security
Abstract: We present \textsc{CHOKE}, a novel code-based hybrid key-encapsulation mechanism (KEM) designed to securely and efficiently transmit multiple session keys simultaneously. By encoding $n$ independent session keys with an individually secure linear code and encapsulating each resulting coded symbol using a separate KEM, \textsc{CHOKE} achieves computational individual security -- each key remains secure as long as at least one underlying KEM remains unbroken. Compared to traditional serial or combiner-based hybrid schemes, \textsc{CHOKE} reduces computational and communication costs by an $n$-fold factor. Furthermore, we show that the communication cost of our construction is optimal under the requirement that each KEM must be used at least once.

Title: Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study

Authors: Xinyi Hou, Jiahao Han, Yanjie Zhao, Haoyu Wang
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.02502
Pdf URL: https://arxiv.org/pdf/2505.02502
Copy Paste: [[2505.02502]] Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study(https://arxiv.org/abs/2505.02502)
Keywords: secure, security, large language model
Abstract: Background: Large language models (LLMs) are increasingly deployed via open-source and commercial frameworks, enabling individuals and organizations to self-host advanced AI capabilities. However, insecure defaults and misconfigurations often expose LLM services to the public Internet, posing significant security and system engineering risks. Aims: This study aims to unveil the current landscape of public-facing LLM deployments in the wild through a large-scale empirical study, focusing on service prevalence, exposure characteristics, systemic vulnerabilities, and associated risks. Method: We conducted an Internet-wide measurement to identify public-facing LLM deployments across 15 frameworks, discovering 320,102 services. We extracted 158 unique API endpoints, grouped into 12 functional categories based on capabilities and security risks. We further analyzed configurations, authentication practices, and geographic distributions, revealing deployment trends and systemic issues in real-world LLM system engineering. Results: Our study shows that public LLM deployments are rapidly growing but often insecure. Among all endpoints, we observe widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. Security risks, including model disclosure, system leakage, and unauthorized access, are pervasive, highlighting the need for secure-by-default frameworks and stronger deployment practices. Conclusions: Public-facing LLM deployments suffer from widespread security and configuration flaws, exposing services to misuse, model theft, resource hijacking, and remote exploitation. Strengthening default security, deployment practices, and operational standards is critical for the growing self-hosted LLM ecosystem.

Title: Exploring Design Choices for Autoregressive Deep Learning Climate Models

Authors: Florian Gallusser, Simon Hentschel, Anna Krause, Andreas Hotho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02506
Pdf URL: https://arxiv.org/pdf/2505.02506
Copy Paste: [[2505.02506]] Exploring Design Choices for Autoregressive Deep Learning Climate Models(https://arxiv.org/abs/2505.02506)
Keywords: robust
Abstract: Deep Learning models have achieved state-of-the-art performance in medium-range weather prediction but often fail to maintain physically consistent rollouts beyond 14 days. In contrast, a few atmospheric models demonstrate stability over decades, though the key design choices enabling this remain unclear. This study quantitatively compares the long-term stability of three prominent DL-MWP architectures - FourCastNet, SFNO, and ClimaX - trained on ERA5 reanalysis data at 5.625° resolution. We systematically assess the impact of autoregressive training steps, model capacity, and choice of prognostic variables, identifying configurations that enable stable 10-year rollouts while preserving the statistical properties of the reference dataset. Notably, rollouts with SFNO exhibit the greatest robustness to hyperparameter choices, yet all models can experience instability depending on the random seed and the set of prognostic variables

Title: FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization

Authors: Hongze Li, Zesheng Zhou, Zhenbiao Cao, Xinhui Li, Wei Chen, Xiaojin Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02515
Pdf URL: https://arxiv.org/pdf/2505.02515
Copy Paste: [[2505.02515]] FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization(https://arxiv.org/abs/2505.02515)
Keywords: privacy, federate
Abstract: Traditional domain generalization approaches predominantly focus on leveraging target domain-aware features while overlooking the critical role of source domain-specific characteristics, particularly in federated settings with inherent data isolation. To address this gap, we propose the Federated Source Domain Awareness Framework (FedSDAF), the first method to systematically exploit source domain-aware features for enhanced federated domain generalization (FedDG). The FedSDAF framework consists of two synergistic components: the Domain-Invariant Adapter, which preserves critical domain-invariant features, and the Domain-Aware Adapter, which extracts and integrates source domain-specific knowledge using a Multihead Self-Attention mechanism (MHSA). Furthermore, we introduce a bidirectional knowledge distillation mechanism that fosters knowledge sharing among clients while safeguarding privacy. Our approach represents the first systematic exploitation of source domain-aware features, resulting in significant advancements in model generalization this http URL experiments on four standard benchmarks (OfficeHome, PACS, VLCS, and DomainNet) show that our method consistently surpasses state-of-the-art federated domain generalization approaches, with accuracy gains of 5.2-13.8%. The source code is available at this https URL.

Title: Attestable builds: compiling verifiable binaries on untrusted systems using trusted execution environments

Authors: Daniel Hugenroth, Mario Lins, René Mayrhofer, Alastair Beresford
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02521
Pdf URL: https://arxiv.org/pdf/2505.02521
Copy Paste: [[2505.02521]] Attestable builds: compiling verifiable binaries on untrusted systems using trusted execution environments(https://arxiv.org/abs/2505.02521)
Keywords: security
Abstract: In this paper we present attestable builds, a new paradigm to provide strong source-to-binary correspondence in software artifacts. We tackle the challenge of opaque build pipelines that disconnect the trust between source code, which can be understood and audited, and the final binary artifact, which is difficult to inspect. Our system uses modern trusted execution environments (TEEs) and sandboxed build containers to provide strong guarantees that a given artifact was correctly built from a specific source code snapshot. As such it complements existing approaches like reproducible builds which typically require time-intensive modifications to existing build configurations and dependencies, and require independent parties to continuously build and verify artifacts. In comparison, an attestable build requires only minimal changes to an existing project, and offers nearly instantaneous verification of the correspondence between a given binary and the source code and build pipeline used to construct it. We evaluate it by building open-source software libraries - focusing on projects which are important to the trust chain and those which have proven difficult to be built deterministically. Overall, the overhead (42 seconds start-up latency and 14% increase in build duration) is small in comparison to the overall build time. Importantly, our prototype builds even complex projects such as LLVM Clang without requiring any modifications to their source code and build scripts. Finally, we formally model and verify the attestable build design to demonstrate its security against well-resourced adversaries.

Title: Text to Image Generation and Editing: A Survey

Authors: Pengfei Yang, Ngai-Man Cheung, Xinda Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02527
Pdf URL: https://arxiv.org/pdf/2505.02527
Copy Paste: [[2505.02527]] Text to Image Generation and Editing: A Survey(https://arxiv.org/abs/2505.02527)
Keywords: diffusion
Abstract: Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.

Title: RobSurv: Vector Quantization-Based Multi-Modal Learning for Robust Cancer Survival Prediction

Authors: Aiman Farooq, Azad Singh, Deepak Mishra, Santanu Chaudhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02529
Pdf URL: https://arxiv.org/pdf/2505.02529
Copy Paste: [[2505.02529]] RobSurv: Vector Quantization-Based Multi-Modal Learning for Robust Cancer Survival Prediction(https://arxiv.org/abs/2505.02529)
Keywords: robust, transformer
Abstract: Cancer survival prediction using multi-modal medical imaging presents a critical challenge in oncology, mainly due to the vulnerability of deep learning models to noise and protocol variations across imaging centers. Current approaches struggle to extract consistent features from heterogeneous CT and PET images, limiting their clinical applicability. We address these challenges by introducing RobSurv, a robust deep-learning framework that leverages vector quantization for resilient multi-modal feature learning. The key innovation of our approach lies in its dual-path architecture: one path maps continuous imaging features to learned discrete codebooks for noise-resistant representation, while the parallel path preserves fine-grained details through continuous feature processing. This dual representation is integrated through a novel patch-wise fusion mechanism that maintains local spatial relationships while capturing global context via Transformer-based processing. In extensive evaluations across three diverse datasets (HECKTOR, H\&N1, and NSCLC Radiogenomics), RobSurv demonstrates superior performance, achieving concordance index of 0.771, 0.742, and 0.734 respectively - significantly outperforming existing methods. Most notably, our model maintains robust performance even under severe noise conditions, with performance degradation of only 3.8-4.5\% compared to 8-12\% in baseline methods. These results, combined with strong generalization across different cancer types and imaging protocols, establish RobSurv as a promising solution for reliable clinical prognosis that can enhance treatment planning and patient care.

Title: Marker-Based Extrinsic Calibration Method for Accurate Multi-Camera 3D Reconstruction

Authors: Nahuel Garcia-D'Urso, Bernabe Sanchez-Sos, Jorge Azorin-Lopez, Andres Fuster-Guillo, Antonio Macia-Lillo, Higinio Mora-Mora
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.02539
Pdf URL: https://arxiv.org/pdf/2505.02539
Copy Paste: [[2505.02539]] Marker-Based Extrinsic Calibration Method for Accurate Multi-Camera 3D Reconstruction(https://arxiv.org/abs/2505.02539)
Keywords: robust
Abstract: Accurate 3D reconstruction using multi-camera RGB-D systems critically depends on precise extrinsic calibration to achieve proper alignment between captured views. In this paper, we introduce an iterative extrinsic calibration method that leverages the geometric constraints provided by a three-dimensional marker to significantly improve calibration accuracy. Our proposed approach systematically segments and refines marker planes through clustering, regression analysis, and iterative reassignment techniques, ensuring robust geometric correspondence across camera views. We validate our method comprehensively in both controlled environments and practical real-world settings within the Tech4Diet project, aimed at modeling the physical progression of patients undergoing nutritional treatments. Experimental results demonstrate substantial reductions in alignment errors, facilitating accurate and reliable 3D reconstructions.

Title: Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

Authors: Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02540
Pdf URL: https://arxiv.org/pdf/2505.02540
Copy Paste: [[2505.02540]] Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data(https://arxiv.org/abs/2505.02540)
Keywords: federate
Abstract: In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence', to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model's performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.

Title: Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identfication

Authors: Yongxiang Li, Yuan Sun, Yang Qin, Dezhong Peng, Xi Peng, Peng Hu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.02549
Pdf URL: https://arxiv.org/pdf/2505.02549
Copy Paste: [[2505.02549]] Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identfication(https://arxiv.org/abs/2505.02549)
Keywords: robust
Abstract: Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.

Title: Bielik v3 Small: Technical Report

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02550
Pdf URL: https://arxiv.org/pdf/2505.02550
Copy Paste: [[2505.02550]] Bielik v3 Small: Technical Report(https://arxiv.org/abs/2505.02550)
Keywords: generative
Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

Title: Robustness questions the interpretability of graph neural networks: what to do?

Authors: Kirill Lukyanov (1 and 2 and 3), Georgii Sazonov (2 and 4), Serafim Boyarsky (6), Ilya Makarov (1 v 5) ((1) ISP RAS Research Center for Trusted Artificial Intelligence, (2) Ivannikov Institute for System Programming of the Russian Academy of Sciences, (3) Moscow Institute of Physics and Technology (National Research University), (4) Lomonosov Moscow State University, (5) AIRI, (6) Yandex School of Data Analysis)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02566
Pdf URL: https://arxiv.org/pdf/2505.02566
Copy Paste: [[2505.02566]] Robustness questions the interpretability of graph neural networks: what to do?(https://arxiv.org/abs/2505.02566)
Keywords: defense, attack, robust, interpretability
Abstract: Graph Neural Networks (GNNs) have become a cornerstone in graph-based data analysis, with applications in diverse domains such as bioinformatics, social networks, and recommendation systems. However, the interplay between model interpretability and robustness remains poorly understood, especially under adversarial scenarios like poisoning and evasion attacks. This paper presents a comprehensive benchmark to systematically analyze the impact of various factors on the interpretability of GNNs, including the influence of robustness-enhancing defense mechanisms. We evaluate six GNN architectures based on GCN, SAGE, GIN, and GAT across five datasets from two distinct domains, employing four interpretability metrics: Fidelity, Stability, Consistency, and Sparsity. Our study examines how defenses against poisoning and evasion attacks, applied before and during model training, affect interpretability and highlights critical trade-offs between robustness and interpretability. The framework will be published as open source. The results reveal significant variations in interpretability depending on the chosen defense methods and model architecture characteristics. By establishing a standardized benchmark, this work provides a foundation for developing GNNs that are both robust to adversarial threats and interpretable, facilitating trust in their deployment in sensitive applications.

Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Authors: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02567
Pdf URL: https://arxiv.org/pdf/2505.02567
Copy Paste: [[2505.02567]] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities(https://arxiv.org/abs/2505.02567)
Keywords: diffusion
Abstract: Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey will be available on GitHub soon.

Title: Rethinking Federated Graph Learning: A Data Condensation Perspective

Authors: Hao Zhang, Xunkai Li, Yinlin Zhu, Lianglin Hu
Subjects: cs.LG, cs.AI, cs.DB, cs.SI
Abstract URL: https://arxiv.org/abs/2505.02573
Pdf URL: https://arxiv.org/pdf/2505.02573
Copy Paste: [[2505.02573]] Rethinking Federated Graph Learning: A Data Condensation Perspective(https://arxiv.org/abs/2505.02573)
Keywords: privacy, federate
Abstract: Federated graph learning is a widely recognized technique that promotes collaborative training of graph neural networks (GNNs) by multi-client this http URL, existing approaches heavily rely on the communication of model parameters or gradients for federated optimization and fail to adequately address the data heterogeneity introduced by intricate and diverse graph distributions. Although some methods attempt to share additional messages among the server and clients to improve federated convergence during communication, they introduce significant privacy risks and increase communication overhead. To address these issues, we introduce the concept of a condensed graph as a novel optimization carrier to address FGL data heterogeneity and propose a new FGL paradigm called FedGM. Specifically, we utilize a generalized condensation graph consensus to aggregate comprehensive knowledge from distributed graphs, while minimizing communication costs and privacy risks through a single transmission of the condensed data. Extensive experiments on six public datasets consistently demonstrate the superiority of FedGM over state-of-the-art baselines, highlighting its potential for a novel FGL paradigm.

Title: EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Authors: Lingxiao Kong (1), Cong Yang (2), Susanne Neufang (3), Oya Deniz Beyan (1,3), Zeyd Boukhers (1,3) ((1) Fraunhofer Institute for Applied Information Technology FIT, (2) Soochow University, (3) University Hospital of Cologne)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02579
Pdf URL: https://arxiv.org/pdf/2505.02579
Copy Paste: [[2505.02579]] EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning(https://arxiv.org/abs/2505.02579)
Keywords: explainability, large language model
Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.

Title: Towards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era

Authors: Chenxi Liu, Shaowen Zhou, Qianxiong Xu, Hao Miao, Cheng Long, Ziyue Li, Rui Zhao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02583
Pdf URL: https://arxiv.org/pdf/2505.02583
Copy Paste: [[2505.02583]] Towards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era(https://arxiv.org/abs/2505.02583)
Keywords: large language model
Abstract: The proliferation of edge devices has generated an unprecedented volume of time series data across different domains, motivating various well-customized methods. Recently, Large Language Models (LLMs) have emerged as a new paradigm for time series analytics by leveraging the shared sequential nature of textual data and time series. However, a fundamental cross-modality gap between time series and LLMs exists, as LLMs are pre-trained on textual corpora and are not inherently optimized for time series. Many recent proposals are designed to address this issue. In this survey, we provide an up-to-date overview of LLMs-based cross-modality modeling for time series analytics. We first introduce a taxonomy that classifies existing approaches into four groups based on the type of textual data employed for time series modeling. We then summarize key cross-modality strategies, e.g., alignment and fusion, and discuss their applications across a range of downstream tasks. Furthermore, we conduct experiments on multimodal datasets from different application domains to investigate effective combinations of textual data and cross-modality strategies for enhancing time series analytics. Finally, we suggest several promising directions for future research. This survey is designed for a range of professionals, researchers, and practitioners interested in LLM-based time series modeling.

Title: RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet

Authors: Eliraz Orfaig, Inna Stainvas, Igal Bilik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02586
Pdf URL: https://arxiv.org/pdf/2505.02586
Copy Paste: [[2505.02586]] RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet(https://arxiv.org/abs/2505.02586)
Keywords: diffusion
Abstract: This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M$^3$FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.

Title: DELTA: Dense Depth from Events and LiDAR using Transformer's Attention

Authors: Vincent Brebion, Julien Moreau, Franck Davoine
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02593
Pdf URL: https://arxiv.org/pdf/2505.02593
Copy Paste: [[2505.02593]] DELTA: Dense Depth from Events and LiDAR using Transformer's Attention(https://arxiv.org/abs/2505.02593)
Keywords: transformer
Abstract: Event cameras and LiDARs provide complementary yet distinct data: respectively, asynchronous detections of changes in lighting versus sparse but accurate depth information at a fixed rate. To this day, few works have explored the combination of these two modalities. In this article, we propose a novel neural-network-based method for fusing event and LiDAR data in order to estimate dense depth maps. Our architecture, DELTA, exploits the concepts of self- and cross-attention to model the spatial and temporal relations within and between the event and LiDAR data. Following a thorough evaluation, we demonstrate that DELTA sets a new state of the art in the event-based depth estimation problem, and that it is able to reduce the errors up to four times for close ranges compared to the previous SOTA.

Title: Low-Loss Space in Neural Networks is Continuous and Fully Connected

Authors: Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02604
Pdf URL: https://arxiv.org/pdf/2505.02604
Copy Paste: [[2505.02604]] Low-Loss Space in Neural Networks is Continuous and Fully Connected(https://arxiv.org/abs/2505.02604)
Keywords: transformer
Abstract: Visualizations of the loss landscape in neural networks suggest that minima are isolated points. However, both theoretical and empirical studies indicate that it is possible to connect two different minima with a path consisting of intermediate points that also have low loss. In this study, we propose a new algorithm which investigates low-loss paths in the full parameter space, not only between two minima. Our experiments on LeNet5, ResNet18, and Compact Convolutional Transformer architectures consistently demonstrate the existence of such continuous paths in the parameter space. These results suggest that the low-loss region is a fully connected and continuous space in the parameter space. Our findings provide theoretical insight into neural network over-parameterization, highlighting that parameters collectively define a high-dimensional low-loss space, implying parameter redundancy exists only within individual models and not throughout the entire low-loss space. Additionally, our work also provides new visualization methods and opportunities to improve model generalization by exploring the low-loss space that is closer to the origin.

Title: Automatic Proficiency Assessment in L2 English Learners

Authors: Armita Mohammadi, Alessandro Lameiras Koerich, Laureano Moro-Velazquez, Patrick Cardinal
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.02615
Pdf URL: https://arxiv.org/pdf/2505.02615
Copy Paste: [[2505.02615]] Automatic Proficiency Assessment in L2 English Learners(https://arxiv.org/abs/2505.02615)
Keywords: robust
Abstract: Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators, with the inherent intra- and inter-rater variability. This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription. We analyze spoken proficiency classification prediction using diverse architectures, including 2D CNN, frequency-based CNN, ResNet, and a pretrained wav2vec 2.0 model. Additionally, we examine text-based proficiency assessment by fine-tuning a BERT language model within resource constraints. Finally, we tackle the complex task of spontaneous dialogue assessment, managing long-form audio and speaker interactions through separate applications of wav2vec 2.0 and BERT models. Results from experiments on EFCamDat and ANGLISH datasets and a private dataset highlight the potential of deep learning, especially the pretrained wav2vec 2.0 model, for robust automated L2 proficiency evaluation.

Title: Mirror Mean-Field Langevin Dynamics

Authors: Anming Gu, Juno Kim
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02621
Pdf URL: https://arxiv.org/pdf/2505.02621
Copy Paste: [[2505.02621]] Mirror Mean-Field Langevin Dynamics(https://arxiv.org/abs/2505.02621)
Keywords: diffusion
Abstract: The mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over $\mathbb{R}^d$, and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of $\mathbb{R}^d$ by proposing the \emph{mirror mean-field Langevin dynamics} (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

Title: LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Authors: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.02625
Pdf URL: https://arxiv.org/pdf/2505.02625
Copy Paste: [[2505.02625]] LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis(https://arxiv.org/abs/2505.02625)
Keywords: large language model
Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Title: Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models

Authors: Sassan Mokhtar, Arian Mousakhan, Silvio Galesso, Jawad Tayyub, Thomas Brox
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02626
Pdf URL: https://arxiv.org/pdf/2505.02626
Copy Paste: [[2505.02626]] Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models(https://arxiv.org/abs/2505.02626)
Keywords: large language model
Abstract: Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in real-world inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD, exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.

Title: Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning

Authors: Xuan Lin, Qingrui Liu, Hongxin Xiang, Daojian Zeng, Xiangxiang Zeng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02639
Pdf URL: https://arxiv.org/pdf/2505.02639
Copy Paste: [[2505.02639]] Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning(https://arxiv.org/abs/2505.02639)
Keywords: large language model
Abstract: Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.

Title: MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02648
Pdf URL: https://arxiv.org/pdf/2505.02648
Copy Paste: [[2505.02648]] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation(https://arxiv.org/abs/2505.02648)
Keywords: diffusion
Abstract: Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.

Title: Sim2Real in endoscopy segmentation with a novel structure aware image translation

Authors: Clara Tomasini, Luis Riazuelo, Ana C. Murillo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02654
Pdf URL: https://arxiv.org/pdf/2505.02654
Copy Paste: [[2505.02654]] Sim2Real in endoscopy segmentation with a novel structure aware image translation(https://arxiv.org/abs/2505.02654)
Keywords: generative, segmentation
Abstract: Automatic segmentation of anatomical landmarks in endoscopic images can provide assistance to doctors and surgeons for diagnosis, treatments or medical training. However, obtaining the annotations required to train commonly used supervised learning methods is a tedious and difficult task, in particular for real images. While ground truth annotations are easier to obtain for synthetic data, models trained on such data often do not generalize well to real data. Generative approaches can add realistic texture to it, but face difficulties to maintain the structure of the original scene. The main contribution in this work is a novel image translation model that adds realistic texture to simulated endoscopic images while keeping the key scene layout information. Our approach produces realistic images in different endoscopy scenarios. We demonstrate these images can effectively be used to successfully train a model for a challenging end task without any real labeled data. In particular, we demonstrate our approach for the task of fold segmentation in colonoscopy images. Folds are key anatomical landmarks that can occlude parts of the colon mucosa and possible polyps. Our approach generates realistic images maintaining the shape and location of the original folds, after the image-style-translation, better than existing methods. We run experiments both on a novel simulated dataset for fold segmentation, and real data from the EndoMapper (EM) dataset. All our new generated data and new EM metadata is being released to facilitate further research, as no public benchmark is currently available for the task of fold segmentation.

Title: SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting

Authors: Shiwei Guo, Ziang Chen, Yupeng Ma, Yunfei Han, Yi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02655
Pdf URL: https://arxiv.org/pdf/2505.02655
Copy Paste: [[2505.02655]] SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting(https://arxiv.org/abs/2505.02655)
Keywords: transformer
Abstract: The Transformer model has shown strong performance in multivariate time series forecasting by leveraging channel-wise self-attention. However, this approach lacks temporal constraints when computing temporal features and does not utilize cumulative historical series this http URL address these limitations, we propose the Structured Channel-wise Transformer with Cumulative Historical state (SCFormer). SCFormer introduces temporal constraints to all linear transformations, including the query, key, and value matrices, as well as the fully connected layers within the Transformer. Additionally, SCFormer employs High-order Polynomial Projection Operators (HiPPO) to deal with cumulative historical time series, allowing the model to incorporate information beyond the look-back window during prediction. Extensive experiments on multiple real-world datasets demonstrate that SCFormer significantly outperforms mainstream baselines, highlighting its effectiveness in enhancing time series forecasting. The code is publicly available at this https URL

Title: A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Authors: Andrey Sidorenko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02659
Pdf URL: https://arxiv.org/pdf/2505.02659
Copy Paste: [[2505.02659]] A Note on Statistically Accurate Tabular Data Generation Using Large Language Models(https://arxiv.org/abs/2505.02659)
Keywords: large language model
Abstract: Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probobility distributions to enhance the statistical fidelity of LLM-generated tabular data.

Title: A Survey on Progress in LLM Alignment from the Perspective of Reward Design

Authors: Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02666
Pdf URL: https://arxiv.org/pdf/2505.02666
Copy Paste: [[2505.02666]] A Survey on Progress in LLM Alignment from the Perspective of Reward Design(https://arxiv.org/abs/2505.02666)
Keywords: large language model
Abstract: The alignment of large language models (LLMs) with human values and intentions represents a core challenge in current AI research, where reward mechanism design has become a critical factor in shaping model behavior. This study conducts a comprehensive investigation of reward mechanisms in LLM alignment through a systematic theoretical framework, categorizing their development into three key phases: (1) feedback (diagnosis), (2) reward design (prescription), and (3) optimization (treatment). Through a four-dimensional analysis encompassing construction basis, format, expression, and granularity, this research establishes a systematic classification framework that reveals evolutionary trends in reward modeling. The field of LLM alignment faces several persistent challenges, while recent advances in reward design are driving significant paradigm shifts. Notable developments include the transition from reinforcement learning-based frameworks to novel optimization paradigms, as well as enhanced capabilities to address complex alignment scenarios involving multimodal integration and concurrent task coordination. Finally, this survey outlines promising future research directions for LLM alignment through innovative reward design strategies.

Title: Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

Authors: Xiaobao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02686
Pdf URL: https://arxiv.org/pdf/2505.02686
Copy Paste: [[2505.02686]] Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models(https://arxiv.org/abs/2505.02686)
Keywords: large language model
Abstract: Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at this https URL.

Title: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Authors: Bojin Wu, Jing Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02704
Pdf URL: https://arxiv.org/pdf/2505.02704
Copy Paste: [[2505.02704]] Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery(https://arxiv.org/abs/2505.02704)
Keywords: robust
Abstract: We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: this https URL.

Title: SoK: Stealing Cars Since Remote Keyless Entry Introduction and How to Defend From It

Authors: Tommaso Bianchi, Alessandro Brighente, Mauro Conti, Edoardo Pavan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02713
Pdf URL: https://arxiv.org/pdf/2505.02713
Copy Paste: [[2505.02713]] SoK: Stealing Cars Since Remote Keyless Entry Introduction and How to Defend From It(https://arxiv.org/abs/2505.02713)
Keywords: security, protect, defense, attack, steal
Abstract: Remote Keyless Entry (RKE) systems have been the target of thieves since their introduction in automotive industry. Robberies targeting vehicles and their remote entry systems are booming again without a significant advancement from the industrial sector being able to protect against them. Researchers and attackers continuously play cat and mouse to implement new methodologies to exploit weaknesses and defense strategies for RKEs. In this fragment, different attacks and defenses have been discussed in research and industry without proper bridging. In this paper, we provide a Systematization Of Knowledge (SOK) on RKE and Passive Keyless Entry and Start (PKES), focusing on their history and current situation, ranging from legacy systems to modern web-based ones. We provide insight into vehicle manufacturers' technologies and attacks and defense mechanisms involving them. To the best of our knowledge, this is the first comprehensive SOK on RKE systems, and we address specific research questions to understand the evolution and security status of such systems. By identifying the weaknesses RKE still faces, we provide future directions for security researchers and companies to find viable solutions to address old attacks, such as Relay and RollJam, as well as new ones, like API vulnerabilities.

Title: Less is More: Efficient Weight Farcasting with 1-Layer Neural Network

Authors: Xiao Shou, Debarun Bhattacharjya, Yanna Ding, Chen Zhao, Rui Li, Jianxi Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02714
Pdf URL: https://arxiv.org/pdf/2505.02714
Copy Paste: [[2505.02714]] Less is More: Efficient Weight Farcasting with 1-Layer Neural Network(https://arxiv.org/abs/2505.02714)
Keywords: large language model
Abstract: Addressing the computational challenges inherent in training large-scale deep neural networks remains a critical endeavor in contemporary machine learning research. While previous efforts have focused on enhancing training efficiency through techniques such as gradient descent with momentum, learning rate scheduling, and weight regularization, the demand for further innovation continues to burgeon as model sizes keep expanding. In this study, we introduce a novel framework which diverges from conventional approaches by leveraging long-term time series forecasting techniques. Our method capitalizes solely on initial and final weight values, offering a streamlined alternative for complex model architectures. We also introduce a novel regularizer that is tailored to enhance the forecasting performance of our approach. Empirical evaluations conducted on synthetic weight sequences and real-world deep learning architectures, including the prominent large language model DistilBERT, demonstrate the superiority of our method in terms of forecasting accuracy and computational efficiency. Notably, our framework showcases improved performance while requiring minimal additional computational overhead, thus presenting a promising avenue for accelerating the training process across diverse tasks and architectures.

Title: Acoustic Side-Channel Attacks on a Computer Mouse

Authors: Mauro Conti, Marin Duroyon, Gabriele Orazi, Gene Tsudik
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.02725
Pdf URL: https://arxiv.org/pdf/2505.02725
Copy Paste: [[2505.02725]] Acoustic Side-Channel Attacks on a Computer Mouse(https://arxiv.org/abs/2505.02725)
Keywords: security, attack
Abstract: Acoustic Side-Channel Attacks (ASCAs) extract sensitive information by using audio emitted from a computing devices and their peripherals. Attacks targeting keyboards are popular and have been explored in the literature. However, similar attacks targeting other human interface peripherals, such as computer mice, are under-explored. To this end, this paper considers security leakage via acoustic signals emanating from normal mouse usage. We first confirm feasibility of such attacks by showing a proof-of-concept attack that classifies four mouse movements with 97% accuracy in a controlled environment. We then evolve the attack towards discerning twelve unique mouse movements using a smartphone to record the experiment. Using Machine Learning (ML) techniques, the model is trained on an experiment with six participants to be generalizable and discern among twelve movements with 94% accuracy. In addition, we experiment with an attack that detects a user action of closing a full-screen window on a laptop. Achieving an accuracy of 91%, this experiment highlights exploiting audio leakage from computer mouse movements in a realistic scenario.

Title: Knowledge Graphs for Enhancing Large Language Models in Entity Disambiguation

Authors: Pons Gerard, Bilalli Besim, Queralt Anna
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.02737
Pdf URL: https://arxiv.org/pdf/2505.02737
Copy Paste: [[2505.02737]] Knowledge Graphs for Enhancing Large Language Models in Entity Disambiguation(https://arxiv.org/abs/2505.02737)
Keywords: large language model
Abstract: Recent advances in Large Language Models (LLMs) have positioned them as a prominent solution for Natural Language Processing tasks. Notably, they can approach these problems in a zero or few-shot manner, thereby eliminating the need for training or fine-tuning task-specific models. However, LLMs face some challenges, including hallucination and the presence of outdated knowledge or missing information from specific domains in the training data. These problems cannot be easily solved by retraining the models with new data as it is a time-consuming and expensive process. To mitigate these issues, Knowledge Graphs (KGs) have been proposed as a structured external source of information to enrich LLMs. With this idea, in this work we use KGs to enhance LLMs for zero-shot Entity Disambiguation (ED). For that purpose, we leverage the hierarchical representation of the entities' classes in a KG to gradually prune the candidate space as well as the entities' descriptions to enrich the input prompt with additional factual knowledge. Our evaluation on popular ED datasets shows that the proposed method outperforms non-enhanced and description-only enhanced LLMs, and has a higher degree of adaptability than task-specific models. Furthermore, we conduct an error analysis and discuss the impact of the leveraged KG's semantic expressivity on the ED performance.

Title: Cooperative Bayesian and variance networks disentangle aleatoric and epistemic uncertainties

Authors: Jiaxiang Yi, Miguel A. Bessa
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02743
Pdf URL: https://arxiv.org/pdf/2505.02743
Copy Paste: [[2505.02743]] Cooperative Bayesian and variance networks disentangle aleatoric and epistemic uncertainties(https://arxiv.org/abs/2505.02743)
Keywords: robust
Abstract: Real-world data contains aleatoric uncertainty - irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean variance estimation (MVE) networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance network with a Bayesian neural network and demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.

Title: Using Knowledge Graphs to harvest datasets for efficient CLIP model training

Authors: Simon Ging, Sebastian Walter, Jelena Bratulić, Johannes Dienert, Hannah Bast, Thomas Brox
Subjects: cs.CV, cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02746
Pdf URL: https://arxiv.org/pdf/2505.02746
Copy Paste: [[2505.02746]] Using Knowledge Graphs to harvest datasets for efficient CLIP model training(https://arxiv.org/abs/2505.02746)
Keywords: robust
Abstract: Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover well -- and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.

Title: Platelet enumeration in dense aggregates

Authors: H. Martin Gillis, Yogeshwar Shendye, Paul Hollensen, Alan Fine, Thomas Trappenberg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02751
Pdf URL: https://arxiv.org/pdf/2505.02751
Copy Paste: [[2505.02751]] Platelet enumeration in dense aggregates(https://arxiv.org/abs/2505.02751)
Keywords: segmentation
Abstract: Identifying and counting blood components such as red blood cells, various types of white blood cells, and platelets is a critical task for healthcare practitioners. Deep learning approaches, particularly convolutional neural networks (CNNs) using supervised learning strategies, have shown considerable success for such tasks. However, CNN based architectures such as U-Net, often struggles to accurately identify platelets due to their sizes and high variability of features. To address these challenges, researchers have commonly employed strategies such as class weighted loss functions, which have demonstrated some success. However, this does not address the more significant challenge of platelet variability in size and tendency to form aggregates and associations with other blood components. In this study, we explored an alternative approach by investigating the role of convolutional kernels in mitigating these issues. We also assigned separate classes to singular platelets and platelet aggregates and performed semantic segmentation using various U-Net architectures for identifying platelets. We then evaluated and compared two common methods (pixel area method and connected component analysis) for counting platelets and proposed an alternative approach specialized for single platelets and platelet aggregates. Our experiments provided results that showed significant improvements in the identification of platelets, highlighting the importance of optimizing convolutional operations and class designations. We show that the common practice of pixel area-based counting often over estimate platelet counts, whereas the proposed method presented in this work offers significant improvements. We discuss in detail about these methods from segmentation masks.

Title: Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

Authors: Yankai Jiang, Peng Zhang, Donglin Yang, Yuan Tian, Hai Lin, Xiaosong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02753
Pdf URL: https://arxiv.org/pdf/2505.02753
Copy Paste: [[2505.02753]] Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models(https://arxiv.org/abs/2505.02753)
Keywords: diffusion, segmentation
Abstract: We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at this https URL.

Title: Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models

Authors: Matthew Dahl
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.02763
Pdf URL: https://arxiv.org/pdf/2505.02763
Copy Paste: [[2505.02763]] Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models(https://arxiv.org/abs/2505.02763)
Keywords: large language model
Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

Title: Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

Authors: Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl, Maurice Aschwanden, Qi Zeng, Davood Karimi, Denis Peruzzo, Tommaso Ciceri, Giorgio Longari, Rachika E. Hamadache, Amina Bouzid, Xavier Lladó, Simone Chiarella, Gerard Martí-Juan, Miguel Ángel González Ballester, Marco Castellaro, Marco Pinamonti, Valentina Visani, Robin Cremese, Keïn Sam, Fleur Gaudfernau, Param Ahir, Mehul Parikh, Maximilian Zenk, Michael Baumgartner, Klaus Maier-Hein, Li Tianhong, Yang Hong, Zhao Longfei, Domen Preloznik, Žiga Špiclin, Jae Won Choi, Muyang Li, Jia Fu, Guotai Wang, Jingwen Jiang, Lyuyang Tong, Bo Du, Andrea Gondova, Sungmin You, Kiho Im, Abdul Qayyum, Moona Mazher, Steven A Niederer, Maya Yanko, Bella Specktor-Fadida, Dafna Ben Bashat, Andras Jakab, Roxane Licandro, Kelly Payette, Meritxell Bach Cuadra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02784
Pdf URL: https://arxiv.org/pdf/2505.02784
Copy Paste: [[2505.02784]] Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge(https://arxiv.org/abs/2505.02784)
Keywords: robust, biometric, segmentation
Abstract: Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.

Title: HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

Authors: Zheng Lin, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Praneeth Vepakomma, Wei Ni, Jun Luo, Yue Gao
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2505.02795
Pdf URL: https://arxiv.org/pdf/2505.02795
Copy Paste: [[2505.02795]] HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models(https://arxiv.org/abs/2505.02795)
Keywords: federate, large language model
Abstract: Recently, large language models (LLMs) have achieved remarkable breakthroughs, revolutionizing the natural language processing domain and beyond. Due to immense parameter sizes, fine-tuning these models with private data for diverse downstream tasks has become mainstream. Though federated learning (FL) offers a promising solution for fine-tuning LLMs without sharing raw data, substantial computing costs hinder its democratization. Moreover, in real-world scenarios, private client devices often possess heterogeneous computing resources, further complicating LLM fine-tuning. To combat these challenges, we propose HSplitLoRA, a heterogeneous parameter-efficient fine-tuning (PEFT) framework built on split learning (SL) and low-rank adaptation (LoRA) fine-tuning, for efficiently fine-tuning LLMs on heterogeneous client devices. HSplitLoRA first identifies important weights based on their contributions to LLM training. It then dynamically configures the decomposition ranks of LoRA adapters for selected weights and determines the model split point according to varying computing budgets of client devices. Finally, a noise-free adapter aggregation mechanism is devised to support heterogeneous adapter aggregation without introducing noise. Extensive experiments demonstrate that HSplitLoRA outperforms state-of-the-art benchmarks in training accuracy and convergence speed.

Title: Towards Quantifying the Hessian Structure of Neural Networks

Authors: Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, Ruoyu Sun
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02809
Pdf URL: https://arxiv.org/pdf/2505.02809
Copy Paste: [[2505.02809]] Towards Quantifying the Hessian Structure of Neural Networks(https://arxiv.org/abs/2505.02809)
Keywords: large language model
Abstract: Empirical studies reported that the Hessian matrix of neural networks (NNs) exhibits a near-block-diagonal structure, yet its theoretical foundation remains unclear. In this work, we reveal two forces that shape the Hessian structure: a ``static force'' rooted in the architecture design, and a ``dynamic force'' arisen from training. We then provide a rigorous theoretical analysis of ``static force'' at random initialization. We study linear models and 1-hidden-layer networks with the mean-square (MSE) loss and the Cross-Entropy (CE) loss for classification tasks. By leveraging random matrix theory, we compare the limit distributions of the diagonal and off-diagonal Hessian blocks and find that the block-diagonal structure arises as $C \rightarrow \infty$, where $C$ denotes the number of classes. Our findings reveal that $C$ is a primary driver of the near-block-diagonal structure. These results may shed new light on the Hessian structure of large language models (LLMs), which typically operate with a large $C$ exceeding $10^4$ or $10^5$.

Title: Database-Agnostic Gait Enrollment using SetTransformers

Authors: Nicoleta Basoc, Adrian Cosma, Andy Cǎtrunǎ, Emilian Rǎdoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02815
Pdf URL: https://arxiv.org/pdf/2505.02815
Copy Paste: [[2505.02815]] Database-Agnostic Gait Enrollment using SetTransformers(https://arxiv.org/abs/2505.02815)
Keywords: transformer
Abstract: Gait recognition has emerged as a powerful tool for unobtrusive and long-range identity analysis, with growing relevance in surveillance and monitoring applications. Although recent advances in deep learning and large-scale datasets have enabled highly accurate recognition under closed-set conditions, real-world deployment demands open-set gait enrollment, which means determining whether a new gait sample corresponds to a known identity or represents a previously unseen individual. In this work, we introduce a transformer-based framework for open-set gait enrollment that is both dataset-agnostic and recognition-architecture-agnostic. Our method leverages a SetTransformer to make enrollment decisions based on the embedding of a probe sample and a context set drawn from the gallery, without requiring task-specific thresholds or retraining for new environments. By decoupling enrollment from the main recognition pipeline, our model is generalized across different datasets, gallery sizes, and identity distributions. We propose an evaluation protocol that uses existing datasets in different ratios of identities and walks per identity. We instantiate our method using skeleton-based gait representations and evaluate it on two benchmark datasets (CASIA-B and PsyMo), using embeddings from three state-of-the-art recognition models (GaitGraph, GaitFormer, and GaitPT). We show that our method is flexible, is able to accurately perform enrollment in different scenarios, and scales better with data compared to traditional approaches. We will make the code and dataset scenarios publicly available.

Title: ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02819
Pdf URL: https://arxiv.org/pdf/2505.02819
Copy Paste: [[2505.02819]] ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations(https://arxiv.org/abs/2505.02819)
Keywords: transformer, large language model
Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.

Title: MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

Authors: Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02823
Pdf URL: https://arxiv.org/pdf/2505.02823
Copy Paste: [[2505.02823]] MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing(https://arxiv.org/abs/2505.02823)
Keywords: robust
Abstract: Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.

Title: Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Authors: Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.02824
Pdf URL: https://arxiv.org/pdf/2505.02824
Copy Paste: [[2505.02824]] Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.02824)
Keywords: attack, robust, watermark, diffusion
Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.

Title: AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

Authors: Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02830
Pdf URL: https://arxiv.org/pdf/2505.02830
Copy Paste: [[2505.02830]] AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation(https://arxiv.org/abs/2505.02830)
Keywords: interpretability, explainability
Abstract: Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level understanding and interaction, and (2) Limited accuracy and interpretability due to single-step reasoning. In this paper, we empower MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.

Title: No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02831
Pdf URL: https://arxiv.org/pdf/2505.02831
Copy Paste: [[2505.02831]] No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves(https://arxiv.org/abs/2505.02831)
Keywords: diffusion, transformer, generative
Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches necessitate to either introduce an additional and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation A}lignment (SRA), a simple yet straightforward method that obtain representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that heavily dependent on powerful external representation priors.

Title: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Authors: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02835
Pdf URL: https://arxiv.org/pdf/2505.02835
Copy Paste: [[2505.02835]] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning(https://arxiv.org/abs/2505.02835)
Keywords: large language model
Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Title: Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Authors: Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02836
Pdf URL: https://arxiv.org/pdf/2505.02836
Copy Paste: [[2505.02836]] Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation(https://arxiv.org/abs/2505.02836)
Keywords: large language model
Abstract: Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.