2025-05-19

Title: Robust Emotion Recognition via Bi-Level Self-Supervised Continual Learning

Authors: Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.10575
Pdf URL: https://arxiv.org/pdf/2505.10575
Copy Paste: [[2505.10575]] Robust Emotion Recognition via Bi-Level Self-Supervised Continual Learning(https://arxiv.org/abs/2505.10575)
Keywords: robust
Abstract: Emotion recognition through physiological signals such as electroencephalogram (EEG) has become an essential aspect of affective computing and provides an objective way to capture human emotions. However, physiological data characterized by cross-subject variability and noisy labels hinder the performance of emotion recognition models. Existing domain adaptation and continual learning methods struggle to address these issues, especially under realistic conditions where data is continuously streamed and unlabeled. To overcome these limitations, we propose a novel bi-level self-supervised continual learning framework, SSOCL, based on a dynamic memory buffer. This bi-level architecture iteratively refines the dynamic buffer and pseudo-label assignments to effectively retain representative samples, enabling generalization from continuous, unlabeled physiological data streams for emotion recognition. The assigned pseudo-labels are subsequently leveraged for accurate emotion prediction. Key components of the framework, including a fast adaptation module and a cluster-mapping module, enable robust learning and effective handling of evolving data streams. Experimental validation on two mainstream EEG tasks demonstrates the framework's ability to adapt to continuous data streams while maintaining strong generalization across subjects, outperforming existing approaches.

Title: Bias and Generalizability of Foundation Models across Datasets in Breast Mammography

Authors: Germani Elodie, Selin Türk Ilayda, Zeineddine Fatima, Mourad Charbel, Albarqouni Shadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10579
Pdf URL: https://arxiv.org/pdf/2505.10579
Copy Paste: [[2505.10579]] Bias and Generalizability of Foundation Models across Datasets in Breast Mammography(https://arxiv.org/abs/2505.10579)
Keywords: fair
Abstract: Over the past decades, computer-aided diagnosis tools for breast cancer have been developed to enhance screening procedures, yet their clinical adoption remains challenged by data variability and inherent biases. Although foundation models (FMs) have recently demonstrated impressive generalizability and transfer learning capabilities by leveraging vast and diverse datasets, their performance can be undermined by spurious correlations that arise from variations in image quality, labeling uncertainty, and sensitive patient attributes. In this work, we explore the fairness and bias of FMs for breast mammography classification by leveraging a large pool of datasets from diverse sources-including data from underrepresented regions and an in-house dataset. Our extensive experiments show that while modality-specific pre-training of FMs enhances performance, classifiers trained on features from individual datasets fail to generalize across domains. Aggregating datasets improves overall performance, yet does not fully mitigate biases, leading to significant disparities across under-represented subgroups such as extreme breast densities and age groups. Furthermore, while domain-adaptation strategies can reduce these disparities, they often incur a performance trade-off. In contrast, fairness-aware techniques yield more stable and equitable performance across subgroups. These findings underscore the necessity of incorporating rigorous fairness evaluations and mitigation strategies into FM-based models to foster inclusive and generalizable AI.

Title: Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Authors: Diogo Freitas, Brigt Håvardstun, Cèsar Ferri, Darío Garigliotti, Jan Arne Telle, José Hernández-Orallo
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.10583
Pdf URL: https://arxiv.org/pdf/2505.10583
Copy Paste: [[2505.10583]] Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models(https://arxiv.org/abs/2505.10583)
Keywords: large language model
Abstract: Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to the similar area in the latent space as a textual description of the strokes that conform the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper we evaluate the complexity of teaching visual-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.

Title: Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios

Authors: Huafeng Shi, Jianzhong Liang, Rongchang Xie, Xian Wu, Cheng Chen, Chang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10584
Pdf URL: https://arxiv.org/pdf/2505.10584
Copy Paste: [[2505.10584]] Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios(https://arxiv.org/abs/2505.10584)
Keywords: diffusion, generative
Abstract: This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework's design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named "Aquarius-Datapipe". Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.

Title: Efficient Malicious UAV Detection Using Autoencoder-TSMamba Integration

Authors: Azim Akhtarshenas, Ramin Toosi, David López-Pérez, Tohid Alizadeh, Alireza Hosseini
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2505.10585
Pdf URL: https://arxiv.org/pdf/2505.10585
Copy Paste: [[2505.10585]] Efficient Malicious UAV Detection Using Autoencoder-TSMamba Integration(https://arxiv.org/abs/2505.10585)
Keywords: robust
Abstract: Malicious Unmanned Aerial Vehicles (UAVs) present a significant threat to next-generation networks (NGNs), posing risks such as unauthorized surveillance, data theft, and the delivery of hazardous materials. This paper proposes an integrated (AE)-classifier system to detect malicious UAVs. The proposed AE, based on a 4-layer Tri-orientated Spatial Mamba (TSMamba) architecture, effectively captures complex spatial relationships crucial for identifying malicious UAV activities. The first phase involves generating residual values through the AE, which are subsequently processed by a ResNet-based classifier. This classifier leverages the residual values to achieve lower complexity and higher accuracy. Our experiments demonstrate significant improvements in both binary and multi-class classification scenarios, achieving up to 99.8 % recall compared to 96.7 % in the benchmark. Additionally, our method reduces computational complexity, making it more suitable for large-scale deployment. These results highlight the robustness and scalability of our approach, offering an effective solution for malicious UAV detection in NGN environments.

Title: Super-Resolution Generative Adversarial Networks based Video Enhancement

Authors: Kağan ÇETİN
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2505.10589
Pdf URL: https://arxiv.org/pdf/2505.10589
Copy Paste: [[2505.10589]] Super-Resolution Generative Adversarial Networks based Video Enhancement(https://arxiv.org/abs/2505.10589)
Keywords: generative
Abstract: This study introduces an enhanced approach to video super-resolution by extending ordinary Single-Image Super-Resolution (SISR) Super-Resolution Generative Adversarial Network (SRGAN) structure to handle spatio-temporal data. While SRGAN has proven effective for single-image enhancement, its design does not account for the temporal continuity required in video processing. To address this, a modified framework that incorporates 3D Non-Local Blocks is proposed, which is enabling the model to capture relationships across both spatial and temporal dimensions. An experimental training pipeline is developed, based on patch-wise learning and advanced data degradation techniques, to simulate real-world video conditions and learn from both local and global structures and details. This helps the model generalize better and maintain stability across varying video content while maintaining the general structure besides the pixel-wise correctness. Two model variants-one larger and one more lightweight-are presented to explore the trade-offs between performance and efficiency. The results demonstrate improved temporal coherence, sharper textures, and fewer visual artifacts compared to traditional single-image methods. This work contributes to the development of practical, learning-based solutions for video enhancement tasks, with potential applications in streaming, gaming, and digital restoration.

Title: ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection

Authors: Xingye Cui, Junhai Luo, Jiakun Deng, Kexuan Li, Xiangyu Qiu, Zhenming Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10595
Pdf URL: https://arxiv.org/pdf/2505.10595
Copy Paste: [[2505.10595]] ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection(https://arxiv.org/abs/2505.10595)
Keywords: robust
Abstract: Infrared small target detection (ISTD) is critical in both civilian and military applications. However, the limited texture and structural information in infrared images makes accurate detection particularly challenging. Although recent deep learning-based methods have improved performance, their use of conventional convolution kernels limits adaptability to complex scenes and diverse targets. Moreover, pooling operations often cause feature loss and insufficient exploitation of image information. To address these issues, we propose an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection (ARFC-WAHNet). This network incorporates a multi-receptive field feature interaction convolution (MRFFIConv) module to adaptively extract discriminative features by integrating multiple convolutional branches with a gated unit. A wavelet frequency enhancement downsampling (WFED) module leverages Haar wavelet transform and frequency-domain reconstruction to enhance target features and suppress background noise. Additionally, we introduce a high-low feature fusion (HLFF) module for integrating low-level details with high-level semantics, and a global median enhancement attention (GMEA) module to improve feature diversity and expressiveness via global attention. Experiments on public datasets SIRST, NUDT-SIRST, and IRSTD-1k demonstrate that ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness, particularly under complex backgrounds. The code is available at this https URL.

Title: Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Authors: Jiazheng Zhang, Wenqing Jing, Zizhuo Zhang, Zhiheng Xi, Shihan Dou, Rongxiang Weng, Jiahuan Li, Jingang Wang, MingXu Cai, Shibo Hong, Tao Gui, Qi Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.10597
Pdf URL: https://arxiv.org/pdf/2505.10597
Copy Paste: [[2505.10597]] Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment(https://arxiv.org/abs/2505.10597)
Keywords: robust, large language model
Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.

Title: Enhancing IoT Cyber Attack Detection in the Presence of Highly Imbalanced Data

Authors: Md. Ehsanul Haque, Md. Saymon Hosen Polash, Md Al-Imran Sanjida Simla, Md Alomgir Hossain, Sarwar Jahan
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.10600
Pdf URL: https://arxiv.org/pdf/2505.10600
Copy Paste: [[2505.10600]] Enhancing IoT Cyber Attack Detection in the Presence of Highly Imbalanced Data(https://arxiv.org/abs/2505.10600)
Keywords: security, attack, robust
Abstract: Due to the rapid growth in the number of Internet of Things (IoT) networks, the cyber risk has increased exponentially, and therefore, we have to develop effective IDS that can work well with highly imbalanced datasets. A high rate of missed threats can be the result, as traditional machine learning models tend to struggle in identifying attacks when normal data volume is much higher than the volume of attacks. For example, the dataset used in this study reveals a strong class imbalance with 94,659 instances of the majority class and only 28 instances of the minority class, making it quite challenging to determine rare attacks accurately. The challenges presented in this research are addressed by hybrid sampling techniques designed to improve data imbalance detection accuracy in IoT domains. After applying these techniques, we evaluate the performance of several machine learning models such as Random Forest, Soft Voting, Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), and Logistic Regression with respect to the classification of cyber-attacks. The obtained results indicate that the Random Forest model achieved the best performance with a Kappa score of 0.9903, test accuracy of 0.9961, and AUC of 0.9994. Strong performance is also shown by the Soft Voting model, with an accuracy of 0.9952 and AUC of 0.9997, indicating the benefits of combining model predictions. Overall, this work demonstrates the value of hybrid sampling combined with robust model and feature selection for significantly improving IoT security against cyber-attacks, especially in highly imbalanced data environments.

Title: Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models

Authors: Hector Pasten, Felipe Urrutia, Hector Jimenez, Cristian B. Calderon, Cristóbal Rojas, Alexander Kozachinskiy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10606
Pdf URL: https://arxiv.org/pdf/2505.10606
Copy Paste: [[2505.10606]] Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models(https://arxiv.org/abs/2505.10606)
Keywords: transformer, large language model
Abstract: Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.

Title: MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Authors: Patara Trirat, Jae-Gil Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10607
Pdf URL: https://arxiv.org/pdf/2505.10607
Copy Paste: [[2505.10607]] MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices(https://arxiv.org/abs/2505.10607)
Keywords: large language model
Abstract: The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.

Title: Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability

Authors: Ken Huang, Vineeth Sai Narajala, Idan Habler, Akram Sheriff
Subjects: cs.CR, cs.AI, cs.MA, cs.NI
Abstract URL: https://arxiv.org/abs/2505.10609
Pdf URL: https://arxiv.org/pdf/2505.10609
Copy Paste: [[2505.10609]] Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability(https://arxiv.org/abs/2505.10609)
Keywords: secure, robust
Abstract: The proliferation of AI agents requires robust mechanisms for secure discovery. This paper introduces the Agent Name Service (ANS), a novel architecture based on DNS addressing the lack of a public agent discovery framework. ANS provides a protocol-agnostic registry infrastructure that leverages Public Key Infrastructure (PKI) certificates for verifiable agent identity and trust. The architecture features several key innovations: a formalized agent registration and renewal mechanism for lifecycle management; DNS-inspired naming conventions with capability-aware resolution; a modular Protocol Adapter Layer supporting diverse communication standards (A2A, MCP, ACP etc.); and precisely defined algorithms for secure resolution. We implement structured communication using JSON Schema and conduct a comprehensive threat analysis of our proposal. The result is a foundational directory service addressing the core challenges of secured discovery and interaction in multi-agent systems, paving the way for future interoperable, trustworthy, and scalable agent ecosystems.

Title: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Authors: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.10610
Pdf URL: https://arxiv.org/pdf/2505.10610
Copy Paste: [[2505.10610]] MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly(https://arxiv.org/abs/2505.10610)
Keywords: robust
Abstract: The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

Title: How many measurements are enough? Bayesian recovery in inverse problems with general distributions

Authors: Ben Adcock, Nick Huang
Subjects: cs.LG, math.ST
Abstract URL: https://arxiv.org/abs/2505.10630
Pdf URL: https://arxiv.org/pdf/2505.10630
Copy Paste: [[2505.10630]] How many measurements are enough? Bayesian recovery in inverse problems with general distributions(https://arxiv.org/abs/2505.10630)
Keywords: generative
Abstract: We study the sample complexity of Bayesian recovery for solving inverse problems with general prior, forward operator and noise distributions. We consider posterior sampling according to an approximate prior $\mathcal{P}$, and establish sufficient conditions for stable and accurate recovery with high probability. Our main result is a non-asymptotic bound that shows that the sample complexity depends on (i) the intrinsic complexity of $\mathcal{P}$, quantified by its so-called approximate covering number, and (ii) concentration bounds for the forward operator and noise distributions. As a key application, we specialize to generative priors, where $\mathcal{P}$ is the pushforward of a latent distribution via a Deep Neural Network (DNN). We show that the sample complexity scales log-linearly with the latent dimension $k$, thus establishing the efficacy of DNN-based priors. Generalizing existing results on deterministic (i.e., non-Bayesian) recovery for the important problem of random sampling with an orthogonal matrix $U$, we show how the sample complexity is determined by the coherence of $U$ with respect to the support of $\mathcal{P}$. Hence, we establish that coherence plays a fundamental role in Bayesian recovery as well. Overall, our framework unifies and extends prior work, providing rigorous guarantees for the sample complexity of solving Bayesian inverse problems with arbitrary distributions.

Title: Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10634
Pdf URL: https://arxiv.org/pdf/2505.10634
Copy Paste: [[2505.10634]] Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding(https://arxiv.org/abs/2505.10634)
Keywords: large language model
Abstract: Language priors constitute one of the primary causes of hallucinations in Large Vision-Language Models (LVLMs), driving the models to generate linguistically plausible yet visually inconsistent content. The language priors in LVLMs originate from the linguistic knowledge inherited from their pre-trained Large Language Model (LLM) backbone. Consequently, this characteristic is an intrinsic property of the model that remains independent of visual inputs. Inspired by the finding that language priors are consistent across images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method to alleviate language priors in LVLMs. CICD first identifies essential and detrimental priors, and then employs contrastive decoding to eliminate the detrimental ones. This approach simultaneously prevents LVLMs from generating hallucinated content while maintaining textual fluency and coherence. Furthermore, the limited information overlap between images helps prevent visual information loss during contrastive decoding. We validate the effectiveness of CICD on four benchmarks with six LVLMs. Our experiments demonstrate that CICD performs remarkably well in mitigating language priors, especially in the image captioning task, where such priors are most pronounced. Code will be released once accepted.

Title: FRET: Feature Redundancy Elimination for Test Time Adaptation

Authors: Linjing You, Jiabao Lu, Xiayuan Huang, Xiangli Nie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10641
Pdf URL: https://arxiv.org/pdf/2505.10641
Copy Paste: [[2505.10641]] FRET: Feature Redundancy Elimination for Test Time Adaptation(https://arxiv.org/abs/2505.10641)
Keywords: privacy, robust
Abstract: Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model's adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the model to extract non-redundant and highly discriminative features during inference, thereby facilitating more robust test-time adaptation.

Title: A Conformal Predictive Measure for Assessing Catastrophic Forgetting

Authors: Ioannis Pitsiorlas, Nour Jamoussi, Marios Kountouris
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10677
Pdf URL: https://arxiv.org/pdf/2505.10677
Copy Paste: [[2505.10677]] A Conformal Predictive Measure for Assessing Catastrophic Forgetting(https://arxiv.org/abs/2505.10677)
Keywords: robust, interpretability
Abstract: This work introduces a novel methodology for assessing catastrophic forgetting (CF) in continual learning. We propose a new conformal prediction (CP)-based metric, termed the Conformal Prediction Confidence Factor (CPCF), to quantify and evaluate CF effectively. Our framework leverages adaptive CP to estimate forgetting by monitoring the model's confidence on previously learned tasks. This approach provides a dynamic and practical solution for monitoring and measuring CF of previous tasks as new ones are introduced, offering greater suitability for real-world applications. Experimental results on four benchmark datasets demonstrate a strong correlation between CPCF and the accuracy of previous tasks, validating the reliability and interpretability of the proposed metric. Our results highlight the potential of CPCF as a robust and effective tool for assessing and understanding CF in dynamic learning environments.

Title: Clustering Rooftop PV Systems via Probabilistic Embeddings

Authors: Kutay Bölat, Tarek Alskaif, Peter Palensky, Simon Tindemans
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.10699
Pdf URL: https://arxiv.org/pdf/2505.10699
Copy Paste: [[2505.10699]] Clustering Rooftop PV Systems via Probabilistic Embeddings(https://arxiv.org/abs/2505.10699)
Keywords: robust
Abstract: As the number of rooftop photovoltaic (PV) installations increases, aggregators and system operators are required to monitor and analyze these systems, raising the challenge of integration and management of large, spatially distributed time-series data that are both high-dimensional and affected by missing values. In this work, a probabilistic entity embedding-based clustering framework is proposed to address these problems. This method encodes each PV system's characteristic power generation patterns and uncertainty as a probability distribution, then groups systems by their statistical distances and agglomerative clustering. Applied to a multi-year residential PV dataset, it produces concise, uncertainty-aware cluster profiles that outperform a physics-based baseline in representativeness and robustness, and support reliable missing-value imputation. A systematic hyperparameter study further offers practical guidance for balancing model performance and robustness.

Title: SafeTrans: LLM-assisted Transpilation from C to Rust

Authors: Muhammad Farrukh (1), Smeet Shah (1), Baris Coskun (2), Michalis Polychronakis (1) ((1) Stony Brook University, (2) Amazon Web Services)
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2505.10708
Pdf URL: https://arxiv.org/pdf/2505.10708
Copy Paste: [[2505.10708]] SafeTrans: LLM-assisted Transpilation from C to Rust(https://arxiv.org/abs/2505.10708)
Keywords: security, large language model
Abstract: Rust is a strong contender for a memory-safe alternative to C as a "systems" programming language, but porting the vast amount of existing C code to Rust is a daunting task. In this paper, we evaluate the potential of large language models (LLMs) to automate the transpilation of C code to idiomatic Rust, while ensuring that the generated code mitigates any memory-related vulnerabilities present in the original code. To that end, we present the design and implementation of SafeTrans, a framework that uses LLMs to i) transpile C code into Rust and ii) iteratively fix any compilation and runtime errors in the resulting code. A key novelty of our approach is the introduction of a few-shot guided repair technique for translation errors, which provides contextual information and example code snippets for specific error types, guiding the LLM toward the correct solution. Another novel aspect of our work is the evaluation of the security implications of the transpilation process, i.e., whether potential vulnerabilities in the original C code have been properly addressed in the translated Rust code. We experimentally evaluated SafeTrans with six leading LLMs and a set of 2,653 C programs accompanied by comprehensive unit tests, which were used for validating the correctness of the translated code. Our results show that our iterative repair strategy improves the rate of successful translations from 54% to 80% for the best-performing LLM (GPT-4o), and that all types of identified vulnerabilities in the original C code are effectively mitigated in the translated Rust code.

Title: GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical Informatics

Authors: Sebestyén Kamp, Giovanni Stracquadanio, T. Ian Simpson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10711
Pdf URL: https://arxiv.org/pdf/2505.10711
Copy Paste: [[2505.10711]] GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical Informatics(https://arxiv.org/abs/2505.10711)
Keywords: robust, fair, interpretability
Abstract: We present GNN-Suite, a robust modular framework for constructing and benchmarking Graph Neural Network (GNN) architectures in computational biology. GNN-Suite standardises experimentation and reproducibility using the Nextflow workflow to evaluate GNN performance. We demonstrate its utility in identifying cancer-driver genes by constructing molecular networks from protein-protein interaction (PPI) data from STRING and BioGRID and annotating nodes with features from the PCAWG, PID, and COSMIC-CGC repositories. Our design enables fair comparisons among diverse GNN architectures including GAT, GAT3H, GCN, GCN2, GIN, GTN, HGCN, PHGCN, and GraphSAGE and a baseline Logistic Regression (LR) model. All GNNs were configured as standardised two-layer models and trained with uniform hyperparameters (dropout = 0.2; Adam optimiser with learning rate = 0.01; and an adjusted binary cross-entropy loss to address class imbalance) over an 80/20 train-test split for 300 epochs. Each model was evaluated over 10 independent runs with different random seeds to yield statistically robust performance metrics, with balanced accuracy (BACC) as the primary measure. Notably, GCN2 achieved the highest BACC (0.807 +/- 0.035) on a STRING-based network, although all GNN types outperformed the LR baseline, highlighting the advantage of network-based learning over feature-only approaches. Our results show that a common framework for implementing and evaluating GNN architectures aids in identifying not only the best model but also the most effective means of incorporating complementary data. By making GNN-Suite publicly available, we aim to foster reproducible research and promote improved benchmarking standards in computational biology. Future work will explore additional omics datasets and further refine network architectures to enhance predictive accuracy and interpretability in biomedical applications.

Title: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

Authors: Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, Paul Vozila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10717
Pdf URL: https://arxiv.org/pdf/2505.10717
Copy Paste: [[2505.10717]] A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment(https://arxiv.org/abs/2505.10717)
Keywords: large language model
Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.

Title: AI-enhanced semantic feature norms for 786 concepts

Authors: Siddharth Suresh, Kushin Mukherjee, Tyler Giallanza, Xizheng Yu, Mia Patil, Jonathan D. Cohen, Timothy T. Rogers
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.10718
Pdf URL: https://arxiv.org/pdf/2505.10718
Copy Paste: [[2505.10718]] AI-enhanced semantic feature norms for 786 concepts(https://arxiv.org/abs/2505.10718)
Keywords: large language model
Abstract: Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.

Title: Tracr-Injection: Distilling Algorithms into Pre-trained Language Models

Authors: Tomás Vergara-Browne, Álvaro Soto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10719
Pdf URL: https://arxiv.org/pdf/2505.10719
Copy Paste: [[2505.10719]] Tracr-Injection: Distilling Algorithms into Pre-trained Language Models(https://arxiv.org/abs/2505.10719)
Keywords: transformer, large language model
Abstract: Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model's residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out of distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.

Title: Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment

Authors: Jia Hui Chin, Pu Zhang, Yu Xin Cheong, Jonathan Pan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10732
Pdf URL: https://arxiv.org/pdf/2505.10732
Copy Paste: [[2505.10732]] Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment(https://arxiv.org/abs/2505.10732)
Keywords: secure, security, large language model
Abstract: In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at the possibility of developing a framework to leverage Large Language Models (LLMs) as an autonomous agent to execute part of the security audit, namely with the field audit. password policy compliance for Windows operating system. Through the conduct of an exploration experiment of using GPT-4 with Langchain, the agent executed the audit tasks by accurately flagging password policy violations and appeared to be more efficient than traditional manual audits. Despite its potential limitations in operational consistency in complex and dynamic environment, the framework suggests possibilities to extend further to real-time threat monitoring and compliance checks.

Title: Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

Authors: Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10736
Pdf URL: https://arxiv.org/pdf/2505.10736
Copy Paste: [[2505.10736]] Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization(https://arxiv.org/abs/2505.10736)
Keywords: large language model
Abstract: Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.

Title: IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation

Authors: Amritanshu Tiwari, Cherish Puniani, Kaustubh Sharma, Ojasva Nema
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10743
Pdf URL: https://arxiv.org/pdf/2505.10743
Copy Paste: [[2505.10743]] IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation(https://arxiv.org/abs/2505.10743)
Keywords: diffusion, generative, segmentation
Abstract: Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational this http URL propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA this http URL framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.

Title: Mapping Semantic Segmentation to Point Clouds Using Structure from Motion for Forest Analysis

Authors: Francisco Raverta Capua, Pablo De Cristoforis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10751
Pdf URL: https://arxiv.org/pdf/2505.10751
Copy Paste: [[2505.10751]] Mapping Semantic Segmentation to Point Clouds Using Structure from Motion for Forest Analysis(https://arxiv.org/abs/2505.10751)
Keywords: segmentation
Abstract: Although the use of remote sensing technologies for monitoring forested environments has gained increasing attention, publicly available point cloud datasets remain scarce due to the high costs, sensor requirements, and time-intensive nature of their acquisition. Moreover, as far as we are aware, there are no public annotated datasets generated through Structure From Motion (SfM) algorithms applied to imagery, which may be due to the lack of SfM algorithms that can map semantic segmentation information into an accurate point cloud, especially in a challenging environment like forests. In this work, we present a novel pipeline for generating semantically segmented point clouds of forest environments. Using a custom-built forest simulator, we generate realistic RGB images of diverse forest scenes along with their corresponding semantic segmentation masks. These labeled images are then processed using modified open-source SfM software capable of preserving semantic information during 3D reconstruction. The resulting point clouds provide both geometric and semantic detail, offering a valuable resource for training and evaluating deep learning models aimed at segmenting real forest point clouds obtained via SfM.

Title: Random Client Selection on Contrastive Federated Learning for Tabular Data

Authors: Achmad Ginanjar, Xue Li, Priyanka Singh, Wen Hua
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2505.10759
Pdf URL: https://arxiv.org/pdf/2505.10759
Copy Paste: [[2505.10759]] Random Client Selection on Contrastive Federated Learning for Tabular Data(https://arxiv.org/abs/2505.10759)
Keywords: secure, security, privacy, attack, robust, federate
Abstract: Vertical Federated Learning (VFL) has revolutionised collaborative machine learning by enabling privacy-preserving model training across multiple parties. However, it remains vulnerable to information leakage during intermediate computation sharing. While Contrastive Federated Learning (CFL) was introduced to mitigate these privacy concerns through representation learning, it still faces challenges from gradient-based attacks. This paper presents a comprehensive experimental analysis of gradient-based attacks in CFL environments and evaluates random client selection as a defensive strategy. Through extensive experimentation, we demonstrate that random client selection proves particularly effective in defending against gradient attacks in the CFL network. Our findings provide valuable insights for implementing robust security measures in contrastive federated learning systems, contributing to the development of more secure collaborative learning frameworks

Title: Deep Symbolic Optimization: Reinforcement Learning for Symbolic Mathematics

Authors: Conor F. Hayes, Felipe Leno Da Silva, Jiachen Yang, T. Nathan Mundhenk, Chak Shing Lee, Jacob F. Pettit, Claudio Santiago, Sookyung Kim, Joanne T. Kim, Ignacio Aravena Solis, Ruben Glatt, Andre R. Goncalves, Alexander Ladd, Ahmet Can Solak, Thomas Desautels, Daniel Faissol, Brenden K. Petersen, Mikel Landajuela
Subjects: cs.LG, cs.NE, cs.SC
Abstract URL: https://arxiv.org/abs/2505.10762
Pdf URL: https://arxiv.org/pdf/2505.10762
Copy Paste: [[2505.10762]] Deep Symbolic Optimization: Reinforcement Learning for Symbolic Mathematics(https://arxiv.org/abs/2505.10762)
Keywords: robust, interpretability, generative
Abstract: Deep Symbolic Optimization (DSO) is a novel computational framework that enables symbolic optimization for scientific discovery, particularly in applications involving the search for intricate symbolic structures. One notable example is equation discovery, which aims to automatically derive mathematical models expressed in symbolic form. In DSO, the discovery process is formulated as a sequential decision-making task. A generative neural network learns a probabilistic model over a vast space of candidate symbolic expressions, while reinforcement learning strategies guide the search toward the most promising regions. This approach integrates gradient-based optimization with evolutionary and local search techniques, and it incorporates in-situ constraints, domain-specific priors, and advanced policy optimization methods. The result is a robust framework capable of efficiently exploring extensive search spaces to identify interpretable and physically meaningful models. Extensive evaluations on benchmark problems have demonstrated that DSO achieves state-of-the-art performance in both accuracy and interpretability. In this chapter, we provide a comprehensive overview of the DSO framework and illustrate its transformative potential for automating symbolic optimization in scientific discovery.

Title: Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

Authors: Jiajun Cheng, Xianwu Zhao, Shan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10764
Pdf URL: https://arxiv.org/pdf/2505.10764
Copy Paste: [[2505.10764]] Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities(https://arxiv.org/abs/2505.10764)
Keywords: explainability
Abstract: Minimally invasive surgery (MIS) presents significant visual and technical challenges, including surgical instrument classification and understanding surgical action involving instruments, verbs, and anatomical targets. While many machine learning-based methods have been developed for surgical understanding, they typically rely on procedure- and task-specific models trained on small, manually annotated datasets. In contrast, the recent success of vision-language models (VLMs) trained on large volumes of raw image-text pairs has demonstrated strong adaptability to diverse visual data and a range of downstream tasks. This opens meaningful research questions: how well do these general-purpose VLMs perform in the surgical domain? In this work, we explore those questions by benchmarking several VLMs across diverse surgical datasets, including general laparoscopic procedures and endoscopic submucosal dissection, to assess their current capabilities and limitations. Our benchmark reveals key gaps in the models' ability to consistently link language to the correct regions in surgical scenes.

Title: Unifying Segment Anything in Microscopy with Multimodal Large Language Model

Authors: Manyu Li, Ruian He, Zixian Zhang, Weimin Tan, Bo Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10769
Pdf URL: https://arxiv.org/pdf/2505.10769
Copy Paste: [[2505.10769]] Unifying Segment Anything in Microscopy with Multimodal Large Language Model(https://arxiv.org/abs/2505.10769)
Keywords: large language model, segmentation
Abstract: Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose using MLLMs to guide SAM in learning microscopy crose-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to prompt SAM. Our method achieves performance improvements of 7.71% in Dice and 12.10% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 6.79% in Dice and 10.08% in SA across 10 out-ofdomain datasets, exhibiting strong generalization capabilities. Code is available at this https URL.

Title: Ranked Voting based Self-Consistency of Large Language Models

Authors: Weiqin Wang, Yile Wang, Hui Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10772
Pdf URL: https://arxiv.org/pdf/2505.10772
Copy Paste: [[2505.10772]] Ranked Voting based Self-Consistency of Large Language Models(https://arxiv.org/abs/2505.10772)
Keywords: large language model
Abstract: Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest "self-consistency" among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at this https URL.

Title: Context-Aware Probabilistic Modeling with LLM for Multimodal Time Series Forecasting

Authors: Yueyang Yao, Jiajun Li, Xingyuan Dai, MengMeng Zhang, Xiaoyan Gong, Fei-Yue Wang, Yisheng Lv
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10774
Pdf URL: https://arxiv.org/pdf/2505.10774
Copy Paste: [[2505.10774]] Context-Aware Probabilistic Modeling with LLM for Multimodal Time Series Forecasting(https://arxiv.org/abs/2505.10774)
Keywords: robust, large language model
Abstract: Time series forecasting is important for applications spanning energy markets, climate analysis, and traffic management. However, existing methods struggle to effectively integrate exogenous texts and align them with the probabilistic nature of large language models (LLMs). Current approaches either employ shallow text-time series fusion via basic prompts or rely on deterministic numerical decoding that conflict with LLMs' token-generation paradigm, which limits contextual awareness and distribution modeling. To address these limitations, we propose CAPTime, a context-aware probabilistic multimodal time series forecasting method that leverages text-informed abstraction and autoregressive LLM decoding. Our method first encodes temporal patterns using a pretrained time series encoder, then aligns them with textual contexts via learnable interactions to produce joint multimodal representations. By combining a mixture of distribution experts with frozen LLMs, we enable context-aware probabilistic forecasting while preserving LLMs' inherent distribution modeling capabilities. Experiments on diverse time series forecasting tasks demonstrate the superior accuracy and generalization of CAPTime, particularly in multimodal scenarios. Additional analysis highlights its robustness in data-scarce scenarios through hybrid probabilistic decoding.

Title: A Systematic Analysis of Base Model Choice for Reward Modeling

Authors: Kian Ahrabian, Pegah Jandaghi, Negar Mokhberian, Sai Praneeth Karimireddy, Jay Pujara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10775
Pdf URL: https://arxiv.org/pdf/2505.10775
Copy Paste: [[2505.10775]] A Systematic Analysis of Base Model Choice for Reward Modeling(https://arxiv.org/abs/2505.10775)
Keywords: large language model
Abstract: Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection ($+$18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.

Title: Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation

Authors: David Minkwan Kim, Soeun Lee, Byeongkeun Kang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10781
Pdf URL: https://arxiv.org/pdf/2505.10781
Copy Paste: [[2505.10781]] Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation(https://arxiv.org/abs/2505.10781)
Keywords: robust, segmentation
Abstract: This work addresses the task of completely weakly supervised class-incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image-level labels. While class-incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel-level annotations for training. To overcome this limitation, partially weakly-supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly-supervised method for CISS. To achieve this, we propose to generate robust pseudo-labels by combining pseudo-labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar-guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15-5 VOC, 10-10 VOC, and COCO-to-VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15-5 VOC and 10-10 VOC settings while achieving competitive accuracy in the COCO-to-VOC setting.

Title: SynRailObs: A Synthetic Dataset for Obstacle Detection in Railway Scenarios

Authors: Qiushi Guo, Jason Rambach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10784
Pdf URL: https://arxiv.org/pdf/2505.10784
Copy Paste: [[2505.10784]] SynRailObs: A Synthetic Dataset for Obstacle Detection in Railway Scenarios(https://arxiv.org/abs/2505.10784)
Keywords: security, diffusion
Abstract: Detecting potential obstacles in railway environments is critical for preventing serious accidents. Identifying a broad range of obstacle categories under complex conditions requires large-scale datasets with precisely annotated, high-quality images. However, existing publicly available datasets fail to meet these requirements, thereby hindering progress in railway safety research. To address this gap, we introduce SynRailObs, a high-fidelity synthetic dataset designed to represent a diverse range of weather conditions and geographical features. Furthermore, diffusion models are employed to generate rare and difficult-to-capture obstacles that are typically challenging to obtain in real-world scenarios. To evaluate the effectiveness of SynRailObs, we perform experiments in real-world railway environments, testing on both ballasted and ballastless tracks across various weather conditions. The results demonstrate that SynRailObs holds substantial potential for advancing obstacle detection in railway safety applications. Models trained on this dataset show consistent performance across different distances and environmental conditions. Moreover, the model trained on SynRailObs exhibits zero-shot capabilities, which are essential for applications in security-sensitive domains. The data is available in this https URL.

Title: Neural-Inspired Advances in Integral Cryptanalysis

Authors: Liu Zhang, Yiran Yao, Danping Shi, Dongchen Chai, Jian Guo, Zilong Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10790
Pdf URL: https://arxiv.org/pdf/2505.10790
Copy Paste: [[2505.10790]] Neural-Inspired Advances in Integral Cryptanalysis(https://arxiv.org/abs/2505.10790)
Keywords: attack
Abstract: The study by Gohr this http URL at CRYPTO 2019 and sunsequent related works have shown that neural networks can uncover previously unused features, offering novel insights into cryptanalysis. Motivated by these findings, we employ neural networks to learn features specifically related to integral properties and integrate the corresponding insights into optimized search frameworks. These findings validate the framework of using neural networks for feature exploration, providing researchers with novel insights that advance established cryptanalysis methods. Neural networks have inspired the development of more precise integral search models. By comparing the integral distinguishers obtained via neural networks with those identified by classical methods, we observe that existing automated search models often fail to find optimal distinguishers. To address this issue, we develop a meet in the middle search framework that balances model accuracy and computational efficiency. As a result, we reduce the number of active plaintext bits required for an 11 rounds integral distinguisher on SKINNY64/64, and further identify a 12 rounds key dependent integral distinguisher achieving one additional round over the previous best-known result. The integral distinguishers discovered by neural networks enable key recovery attacks on more rounds. We identify a 7 rounds key independent integral distinguisher from neural networks with even only one active plaintext cell, which is based on linear combinations of bits. This distinguisher enables a 15 rounds key recovery attack on SKINNYn/n, improving upon the previous record by one round. Additionally, we discover an 8 rounds key dependent integral distinguisher using neural network that further reduces the time complexity of key recovery attacks against SKINNY.

Title: Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

Authors: Zhan Peng Lee, Andre Lin, Calvin Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10792
Pdf URL: https://arxiv.org/pdf/2505.10792
Copy Paste: [[2505.10792]] Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.10792)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructed to mimic real-world imperfections. Experimental results show that Finetune-RAG improves factual accuracy by 21.2% over the base model. We also propose a Bench-RAG, an LLM-as-a-judge evaluation pipeline that stress tests models under realistic imperfect retrieval scenarios. Our codebase and dataset are fully open sourced for community use.

Title: Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets

Authors: Erica Cai, Sean McQuade, Kevin Young, Brendan O'Connor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10798
Pdf URL: https://arxiv.org/pdf/2505.10798
Copy Paste: [[2505.10798]] Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets(https://arxiv.org/abs/2505.10798)
Keywords: extraction
Abstract: When knowledge graphs (KGs) are automatically extracted from text, are they accurate enough for downstream analysis? Unfortunately, current annotated datasets can not be used to evaluate this question, since their KGs are highly disconnected, too small, or overly complex. To address this gap, we introduce AffilKG (this https URL), which is a collection of six datasets that are the first to pair complete book scans with large, labeled knowledge graphs. Each dataset features affiliation graphs, which are simple KGs that capture Member relationships between Person and Organization entities -- useful in studies of migration, community interactions, and other social phenomena. In addition, three datasets include expanded KGs with a wider variety of relation types. Our preliminary experiments demonstrate significant variability in model performance across datasets, underscoring AffilKG's ability to enable two critical advances: (1) benchmarking how extraction errors propagate to graph-level analyses (e.g., community structure), and (2) validating KG extraction methods for real-world social science research.

Title: Attention-Based Reward Shaping for Sparse and Delayed Rewards

Authors: Ian Holmes, Min Chi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10802
Pdf URL: https://arxiv.org/pdf/2505.10802
Copy Paste: [[2505.10802]] Attention-Based Reward Shaping for Sparse and Delayed Rewards(https://arxiv.org/abs/2505.10802)
Keywords: robust, transformer
Abstract: Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.

Title: MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

Authors: Gabriel Maldonado, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10810
Pdf URL: https://arxiv.org/pdf/2505.10810
Copy Paste: [[2505.10810]] MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation(https://arxiv.org/abs/2505.10810)
Keywords: robust
Abstract: Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results. These results highlight MoCLIP's versatility and effectiveness, establishing it as a robust framework for enhancing motion generation.

Title: RAN Tester UE: An Automated Declarative UE Centric Security Testing Platform

Authors: Charles Marion Ueltschey, Joshua Moore, Aly Sabri Abdalla, Vuk Marojevic
Subjects: cs.CR, cs.SE, eess.SY
Abstract URL: https://arxiv.org/abs/2505.10812
Pdf URL: https://arxiv.org/pdf/2505.10812
Copy Paste: [[2505.10812]] RAN Tester UE: An Automated Declarative UE Centric Security Testing Platform(https://arxiv.org/abs/2505.10812)
Keywords: security, protect, robust
Abstract: Cellular networks require strict security procedures and measures across various network components, from core to radio access network (RAN) and end-user devices. As networks become increasingly complex and interconnected, as in O-RAN deployments, they are exposed to a numerous security threats. Therefore, ensuring robust security is critical for O-RAN to protect network integrity and safeguard user data. This requires rigorous testing methodologies to mitigate threats. This paper introduces an automated, adaptive, and scalable user equipment (UE) based RAN security testing framework designed to address the shortcomings of existing RAN testing solutions. Experimental results on a 5G software radio testbed built with commercial off-the-shelf hardware and open source software validate the efficiency and reproducibility of sample security test procedures developed on the RAN Tester UE framework.

Title: Enhancing Secrecy Energy Efficiency in RIS-Aided Aerial Mobile Edge Computing Networks: A Deep Reinforcement Learning Approach

Authors: Aly Sabri Abdalla, Vuk Marojevic
Subjects: cs.CR, cs.DC, eess.SY
Abstract URL: https://arxiv.org/abs/2505.10815
Pdf URL: https://arxiv.org/pdf/2505.10815
Copy Paste: [[2505.10815]] Enhancing Secrecy Energy Efficiency in RIS-Aided Aerial Mobile Edge Computing Networks: A Deep Reinforcement Learning Approach(https://arxiv.org/abs/2505.10815)
Keywords: secure
Abstract: This paper studies the problem of securing task offloading transmissions from ground users against ground eavesdropping threats. Our study introduces a reconfigurable intelligent surface (RIS)-aided unmanned aerial vehicle (UAV)-mobile edge computing (MEC) scheme to enhance the secure task offloading while minimizing the energy consumption of the UAV subject to task completion constraints. Leveraging a data-driven approach, we propose a comprehensive optimization strategy that jointly optimizes the aerial MEC (AMEC)'s trajectory, task offloading partitioning, UE transmission scheduling, and RIS phase shifts. Our objective centers on optimizing the secrecy energy efficiency (SEE) of UE task offloading transmissions while preserving the AMEC's energy resources and meeting the task completion time requirements. Numerical results show that the proposed solution can effectively safeguard legitimate task offloading transmissions while preserving AMEC energy.

Title: Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

Authors: Reilly Haskins, Benjamin Adams
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10822
Pdf URL: https://arxiv.org/pdf/2505.10822
Copy Paste: [[2505.10822]] Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation(https://arxiv.org/abs/2505.10822)
Keywords: robust, interpretability
Abstract: Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teacher and student. Focusing on GPT2-small and its distilled counterpart DistilGPT2, we find that student models reorganize, compress, and discard teacher components, often resulting in stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.

Title: From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Authors: Xue Li, Jameson Merkow, Noel C. F. Codella, Alberto Santamaria-Pang, Naiteek Sangani, Alexander Ersoy, Christopher Burt, John W. Garrett, Richard J. Bruce, Joshua D. Warner, Tyler Bradshaw, Ivan Tarapov, Matthew P. Lungren, Alan B. McMillan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.10823
Pdf URL: https://arxiv.org/pdf/2505.10823
Copy Paste: [[2505.10823]] From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification(https://arxiv.org/abs/2505.10823)
Keywords: robust, fair
Abstract: Foundation models, pretrained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using six foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms. Among these combinations, MedImageInsight embeddings paired with an support vector machine adapter yielded the highest mean area under the curve (mAUC) at 93.8%, followed closely by Rad-DINO (91.1%) and CXR-Foundation (89.0%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with mAUC scores of 83.0% and 81.8%, respectively, whereas Med-Flamingo delivered the lowest performance at 75.1%. Notably, most adapter models demonstrated computational efficiency, achieving training within one minute and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analyses on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 2% and standard deviations across age groups not exceeding 3%. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.

Title: Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances

Authors: Chen-Chi Chang, Chong-Fu Li, Chu-Hsuan Lee, Hung-Shin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10829
Pdf URL: https://arxiv.org/pdf/2505.10829
Copy Paste: [[2505.10829]] Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances(https://arxiv.org/abs/2505.10829)
Keywords: large language model
Abstract: This study investigates the challenges of translating low-resource languages by integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). Various model configurations were tested on Hakka translations, with BLEU scores ranging from 12% (dictionary-only) to 31% (RAG with Gemini 2.0). The best-performing model (Model 4) combined retrieval and advanced language modeling, improving lexical coverage, particularly for specialized or culturally nuanced terms, and enhancing grammatical coherence. A two-stage method (Model 3) using dictionary outputs refined by Gemini 2.0 achieved a BLEU score of 26%, highlighting iterative correction's value and the challenges of domain-specific expressions. Static dictionary-based approaches struggled with context-sensitive content, demonstrating the limitations of relying solely on predefined resources. These results emphasize the need for curated resources, domain knowledge, and ethical collaboration with local communities, offering a framework that improves translation accuracy and fluency while supporting cultural preservation.

Title: Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs

Authors: Abhishek Dey, Aabha Bothera, Samhita Sarikonda, Rishav Aryan, Sanjay Kumar Podishetty, Akshay Havalgi, Gaurav Singh, Saurabh Srivastava
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.10836
Pdf URL: https://arxiv.org/pdf/2505.10836
Copy Paste: [[2505.10836]] Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs(https://arxiv.org/abs/2505.10836)
Keywords: generative
Abstract: In this paper, we study the challenges of detecting events on social media, where traditional unimodal systems struggle due to the rapid and multimodal nature of data dissemination. We employ a range of models, including unimodal ModernBERT and ConvNeXt-V2, multimodal fusion techniques, and advanced generative models like GPT-4o, and LLaVA. Additionally, we also study the effect of providing multimodal generative models (such as GPT-4o) with a single modality to assess their efficacy. Our results indicate that while multimodal approaches notably outperform unimodal counterparts, generative approaches despite having a large number of parameters, lag behind supervised methods in precision. Furthermore, we also found that they lag behind instruction-tuned models because of their inability to generate event classes correctly. During our error analysis, we discovered that common social media issues such as leet speak, text elongation, etc. are effectively handled by generative approaches but are hard to tackle using supervised approaches.

Title: LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Authors: Ran Li, Hao Wang, Chengzhi Mao
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.10838
Pdf URL: https://arxiv.org/pdf/2505.10838
Copy Paste: [[2505.10838]] LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs(https://arxiv.org/abs/2505.10838)
Keywords: attack, steal, large language model
Abstract: Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate. Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization.

Title: RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects

Authors: Jaeguk Kim, Jaewoo Park, Keuntek Lee, Nam Ik Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10841
Pdf URL: https://arxiv.org/pdf/2505.10841
Copy Paste: [[2505.10841]] RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects(https://arxiv.org/abs/2505.10841)
Keywords: robust
Abstract: Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.

Title: AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models

Authors: Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.10846
Pdf URL: https://arxiv.org/pdf/2505.10846
Copy Paste: [[2505.10846]] AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models(https://arxiv.org/abs/2505.10846)
Keywords: attack, robust
Abstract: This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (this https URL). (warning: this paper contains potentially harmful content generated by LRMs.)

Title: On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.10860
Pdf URL: https://arxiv.org/pdf/2505.10860
Copy Paste: [[2505.10860]] On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating(https://arxiv.org/abs/2505.10860)
Keywords: large language model
Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

Title: Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM

Authors: Thang Duong, Minglai Yang, Chicheng Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10861
Pdf URL: https://arxiv.org/pdf/2505.10861
Copy Paste: [[2505.10861]] Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM(https://arxiv.org/abs/2505.10861)
Keywords: large language model
Abstract: We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 \times$ the cumulative rewards of the pure RL baseline.

Title: Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?

Authors: Tairan Fu, Miguel González, Javier Conde, Elena Merino-Gómez, Pedro Reviriego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10862
Pdf URL: https://arxiv.org/pdf/2505.10862
Copy Paste: [[2505.10862]] Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?(https://arxiv.org/abs/2505.10862)
Keywords: large language model
Abstract: Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.

Title: Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate

Authors: Ziyang Huang, Wangtao Sun, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.10870
Pdf URL: https://arxiv.org/pdf/2505.10870
Copy Paste: [[2505.10870]] Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate(https://arxiv.org/abs/2505.10870)
Keywords: large language model
Abstract: This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R$^3$), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.

Title: Optimal Allocation of Privacy Budget on Hierarchical Data Release

Authors: Joonhyuk Ko, Juba Ziani, Ferdinando Fioretto
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.10871
Pdf URL: https://arxiv.org/pdf/2505.10871
Copy Paste: [[2505.10871]] Optimal Allocation of Privacy Budget on Hierarchical Data Release(https://arxiv.org/abs/2505.10871)
Keywords: privacy, protect
Abstract: Releasing useful information from datasets with hierarchical structures while preserving individual privacy presents a significant challenge. Standard privacy-preserving mechanisms, and in particular Differential Privacy, often require careful allocation of a finite privacy budget across different levels and components of the hierarchy. Sub-optimal allocation can lead to either excessive noise, rendering the data useless, or to insufficient protections for sensitive information. This paper addresses the critical problem of optimal privacy budget allocation for hierarchical data release. It formulates this challenge as a constrained optimization problem, aiming to maximize data utility subject to a total privacy budget while considering the inherent trade-offs between data granularity and privacy loss. The proposed approach is supported by theoretical analysis and validated through comprehensive experiments on real hierarchical datasets. These experiments demonstrate that optimal privacy budget allocation significantly enhances the utility of the released data and improves the performance of downstream tasks.

Title: MultiLink: Multi-class Structure Recovery via Agglomerative Clustering and Model Selection

Authors: Luca Magri, Filippo Leveni, Giacomo Boracchi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.10874
Pdf URL: https://arxiv.org/pdf/2505.10874
Copy Paste: [[2505.10874]] MultiLink: Multi-class Structure Recovery via Agglomerative Clustering and Model Selection(https://arxiv.org/abs/2505.10874)
Keywords: robust
Abstract: We address the problem of recovering multiple structures of different classes in a dataset contaminated by noise and outliers. In particular, we consider geometric structures defined by a mixture of underlying parametric models (e.g. planes and cylinders, homographies and fundamental matrices), and we tackle the robust fitting problem by preference analysis and clustering. We present a new algorithm, termed MultiLink, that simultaneously deals with multiple classes of models. MultiLink combines on-the-fly model fitting and model selection in a novel linkage scheme that determines whether two clusters are to be merged. The resulting method features many practical advantages with respect to methods based on preference analysis, being faster, less sensitive to the inlier threshold, and able to compensate limitations deriving from hypotheses sampling. Experiments on several public datasets demonstrate that Multi-Link favourably compares with state of the art alternatives, both in multi-class and single-class problems. Code is publicly made available for download.

Title: A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

Authors: Alexey Magay, Dhurba Tripathi, Yu Hao, Yi Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10875
Pdf URL: https://arxiv.org/pdf/2505.10875
Copy Paste: [[2505.10875]] A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision(https://arxiv.org/abs/2505.10875)
Keywords: robust, large language model
Abstract: People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning capabilities, our method significantly improves the understanding of environmental context, which is critical for navigation and object recognition. The innovation extends to a hardware component, designed as an attachment for glasses, ensuring increased accessibility and ease of use. This integration leverages advanced VLMs to interpret visual data and provide real-time, spatially aware feedback to the user. Our approach aims to bridge the gap between advanced machine learning models and practical, user-friendly assistive devices, offering a robust solution for visually impaired users to navigate their surroundings more effectively and independently. The paper includes an in-depth evaluation using the VizWiz dataset, demonstrating substantial improvements in accuracy and user experience. Additionally, we design a comprehensive dataset to evaluate our method's effectiveness in realworld situations, demonstrating substantial improvements in accuracy and user experience.

Title: Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions

Authors: Guoji Fu, Wee Sun Lee
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.10880
Pdf URL: https://arxiv.org/pdf/2505.10880
Copy Paste: [[2505.10880]] Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions(https://arxiv.org/abs/2505.10880)
Keywords: generative
Abstract: This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution $P_0$ from $n$ i.i.d. observations in $d$ dimensions. Assuming merely that $P_0$ is $\alpha$-sub-Gaussian, we prove that for any time step $t \in [t_0, n^{O(1)}]$, where $t_0 \geq O(\alpha^2n^{-2/d}\log n)$, there exists a deep ReLU neural network with width $\leq O(\log^3n)$ and depth $\leq O(n^{3/d}\log_2n)$ that can approximate the scores with $\tilde{O}(n^{-1})$ mean square error and achieve a nearly optimal rate of $\tilde{O}(n^{-1}t_0^{-d/2})$ for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function $p_0$ lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.

Title: Prior-Guided Diffusion Planning for Offline Reinforcement Learning

Authors: Donghyeon Ki, JunHyeok Oh, Seong-Woong Shim, Byung-Jun Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10881
Pdf URL: https://arxiv.org/pdf/2505.10881
Copy Paste: [[2505.10881]] Prior-Guided Diffusion Planning for Offline Reinforcement Learning(https://arxiv.org/abs/2505.10881)
Keywords: diffusion
Abstract: Diffusion models have recently gained prominence in offline reinforcement learning due to their ability to effectively learn high-performing, generalizable policies from static datasets. Diffusion-based planners facilitate long-horizon decision-making by generating high-quality trajectories through iterative denoising, guided by return-maximizing objectives. However, existing guided sampling strategies such as Classifier Guidance, Classifier-Free Guidance, and Monte Carlo Sample Selection either produce suboptimal multi-modal actions, struggle with distributional drift, or incur prohibitive inference-time costs. To address these challenges, we propose Prior Guidance (PG), a novel guided sampling framework that replaces the standard Gaussian prior of a behavior-cloned diffusion model with a learnable distribution, optimized via a behavior-regularized objective. PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself, and eliminates the need to sample multiple candidates at inference for sample selection. We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-of-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.

Title: PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation

Authors: Saad Manzur, Bryan Vela, Brandon Vela, Aditya Agrawal, Lan-Anh Dang-Vu, David Li, Wayne Hayes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10888
Pdf URL: https://arxiv.org/pdf/2505.10888
Copy Paste: [[2505.10888]] PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation(https://arxiv.org/abs/2505.10888)
Keywords: fair
Abstract: Reliable three-dimensional human pose estimation is becoming increasingly important for real-world applications, yet much of prior work has focused solely on the performance within a single dataset. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups -- conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. To address these challenges, we present a standardized testing environment in which each method is evaluated on a variety of datasets, ensuring consistent and fair cross-dataset comparisons -- allowing for the analysis of methods on previously unseen data. Therefore, we propose PoseBench3D, a unified framework designed to systematically re-evaluate prior and future models across four of the most widely used datasets for human pose estimation -- with the framework able to support novel and future datasets as the field progresses. Through a unified interface, our framework provides datasets in a pre-configured yet easily modifiable format, ensuring compatibility with diverse model architectures. We re-evaluated the work of 18 methods, either trained or gathered from existing literature, and reported results using both Mean Per Joint Position Error (MPJPE) and Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) metrics, yielding more than 100 novel cross-dataset evaluation results. Additionally, we analyze performance differences resulting from various pre-processing techniques and dataset preparation parameters -- offering further insight into model generalization capabilities.

Title: Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10892
Pdf URL: https://arxiv.org/pdf/2505.10892
Copy Paste: [[2505.10892]] Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models(https://arxiv.org/abs/2505.10892)
Keywords: robust, generative
Abstract: Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.

Title: CTP: A hybrid CNN-Transformer-PINN model for ocean front forecasting

Authors: Yishuo Wang, Feng Zhou, Muping Zhou, Qicheng Meng, Zhijun Hu, Yi Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.10894
Pdf URL: https://arxiv.org/pdf/2505.10894
Copy Paste: [[2505.10894]] CTP: A hybrid CNN-Transformer-PINN model for ocean front forecasting(https://arxiv.org/abs/2505.10894)
Keywords: transformer
Abstract: This paper proposes CTP, a novel deep learning framework that integrates convolutional neural network(CNN), Transformer architectures, and physics-informed neural network(PINN) for ocean front prediction. Ocean fronts, as dynamic interfaces between distinct water masses, play critical roles in marine biogeochemical and physical processes. Existing methods such as LSTM, ConvLSTM, and AttentionConv often struggle to maintain spatial continuity and physical consistency over multi-step forecasts. CTP addresses these challenges by combining localized spatial encoding, long-range temporal attention, and physical constraint enforcement. Experimental results across south China sea(SCS) and Kuroshio(KUR) regions from 1993 to 2020 demonstrate that CTP achieves state-of-the-art(SOTA) performance in both single-step and multi-step predictions, significantly outperforming baseline models in accuracy, $F_1$ score, and temporal stability.

Title: On the Security Risks of ML-based Malware Detection Systems: A Survey

Authors: Ping He, Yuhao Mao, Changjiang Li, Lorenzo Cavallaro, Ting Wang, Shouling Ji
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.10903
Pdf URL: https://arxiv.org/pdf/2505.10903
Copy Paste: [[2505.10903]] On the Security Risks of ML-based Malware Detection Systems: A Survey(https://arxiv.org/abs/2505.10903)
Keywords: security, privacy, defense, attack
Abstract: Malware presents a persistent threat to user privacy and data integrity. To combat this, machine learning-based (ML-based) malware detection (MD) systems have been developed. However, these systems have increasingly been attacked in recent years, undermining their effectiveness in practice. While the security risks associated with ML-based MD systems have garnered considerable attention, the majority of prior works is limited to adversarial malware examples, lacking a comprehensive analysis of practical security risks. This paper addresses this gap by utilizing the CIA principles to define the scope of security risks. We then deconstruct ML-based MD systems into distinct operational stages, thus developing a stage-based taxonomy. Utilizing this taxonomy, we summarize the technical progress and discuss the gaps in the attack and defense proposals related to the ML-based MD systems within each stage. Subsequently, we conduct two case studies, using both inter-stage and intra-stage analyses according to the stage-based taxonomy to provide new empirical insights. Based on these analyses and insights, we suggest potential future directions from both inter-stage and intra-stage perspectives.

Title: VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

Authors: Mingxiao Li, Na Su, Fang Qu, Zhizhou Zhong, Ziyang Chen, Zhaopeng Tu, Xiaolong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10917
Pdf URL: https://arxiv.org/pdf/2505.10917
Copy Paste: [[2505.10917]] VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization(https://arxiv.org/abs/2505.10917)
Keywords: large language model
Abstract: Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.

Title: A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Authors: Ada Chen, Yongjiang Wu, Junyuan Zhang, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
Subjects: cs.CL, cs.AI, cs.CR, cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2505.10924
Pdf URL: https://arxiv.org/pdf/2505.10924
Copy Paste: [[2505.10924]] A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?(https://arxiv.org/abs/2505.10924)
Keywords: secure, security
Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

Title: A Dataset for Spatiotemporal-Sensitive POI Question Answering

Authors: Xiao Han, Dayan Pan, Xiangyu Zhao, Xuyuan Hu, Zhaolin Deng, Xiangjie Kong, Guojiang Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10928
Pdf URL: https://arxiv.org/pdf/2505.10928
Copy Paste: [[2505.10928]] A Dataset for Spatiotemporal-Sensitive POI Question Answering(https://arxiv.org/abs/2505.10928)
Keywords: robust
Abstract: Spatiotemporal relationships are critical in data science, as many prediction and reasoning tasks require analysis across both spatial and temporal dimensions--for instance, navigating an unfamiliar city involves planning itineraries that sequence locations and timing cultural experiences. However, existing Question-Answering (QA) datasets lack sufficient spatiotemporal-sensitive questions, making them inadequate benchmarks for evaluating models' spatiotemporal reasoning capabilities. To address this gap, we introduce POI-QA, a novel spatiotemporal-sensitive QA dataset centered on Point of Interest (POI), constructed through three key steps: mining and aligning open-source vehicle trajectory data from GAIA with high-precision geographic POI data, rigorous manual validation of noisy spatiotemporal facts, and generating bilingual (Chinese/English) QA pairs that reflect human-understandable spatiotemporal reasoning tasks. Our dataset challenges models to parse complex spatiotemporal dependencies, and evaluations of state-of-the-art multilingual LLMs (e.g., Qwen2.5-7B, Llama3.1-8B) reveal stark limitations: even the top-performing model (Qwen2.5-7B fine-tuned with RAG+LoRA) achieves a top 10 Hit Ratio (HR@10) of only 0.41 on the easiest task, far below human performance at 0.56. This underscores persistent weaknesses in LLMs' ability to perform consistent spatiotemporal reasoning, while highlighting POI-QA as a robust benchmark to advance algorithms sensitive to spatiotemporal dynamics. The dataset is publicly available at this https URL.

Title: Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models

Authors: Congcong Zhu, Xiaoyan Xu, Jiayue Han, Jingrun Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10930
Pdf URL: https://arxiv.org/pdf/2505.10930
Copy Paste: [[2505.10930]] Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models(https://arxiv.org/abs/2505.10930)
Keywords: robust
Abstract: Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at this https URL.

Title: M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection

Authors: Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10931
Pdf URL: https://arxiv.org/pdf/2505.10931
Copy Paste: [[2505.10931]] M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection(https://arxiv.org/abs/2505.10931)
Keywords: robust
Abstract: Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve $mAP$ by 5.7\% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at this https URL.

Title: Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents

Authors: Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Haoran Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10936
Pdf URL: https://arxiv.org/pdf/2505.10936
Copy Paste: [[2505.10936]] Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents(https://arxiv.org/abs/2505.10936)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.

Title: Accurate KV Cache Quantization with Outlier Tokens Tracing

Authors: Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10938
Pdf URL: https://arxiv.org/pdf/2505.10938
Copy Paste: [[2505.10938]] Accurate KV Cache Quantization with Outlier Tokens Tracing(https://arxiv.org/abs/2505.10938)
Keywords: large language model
Abstract: The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.

Title: GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction

Authors: Mohammadtaha Bagherifard, Sahar Rajabi, Ali Edalat, Yadollah Yaghoobzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10939
Pdf URL: https://arxiv.org/pdf/2505.10939
Copy Paste: [[2505.10939]] GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction(https://arxiv.org/abs/2505.10939)
Keywords: large language model
Abstract: Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at this https URL.

Title: Privacy-Aware Lifelong Learning

Authors: Ozan Özdenizci, Elmar Rueckert, Robert Legenstein
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10941
Pdf URL: https://arxiv.org/pdf/2505.10941
Copy Paste: [[2505.10941]] Privacy-Aware Lifelong Learning(https://arxiv.org/abs/2505.10941)
Keywords: privacy
Abstract: Lifelong learning algorithms enable models to incrementally acquire new knowledge without forgetting previously learned information. Contrarily, the field of machine unlearning focuses on explicitly forgetting certain previous knowledge from pretrained models when requested, in order to comply with data privacy regulations on the right-to-be-forgotten. Enabling efficient lifelong learning with the capability to selectively unlearn sensitive information from models presents a critical and largely unaddressed challenge with contradicting objectives. We address this problem from the perspective of simultaneously preventing catastrophic forgetting and allowing forward knowledge transfer during task-incremental learning, while ensuring exact task unlearning and minimizing memory requirements, based on a single neural network model to be adapted. Our proposed solution, privacy-aware lifelong learning (PALL), involves optimization of task-specific sparse subnetworks with parameter sharing within a single architecture. We additionally utilize an episodic memory rehearsal mechanism to facilitate exact unlearning without performance degradations. We empirically demonstrate the scalability of PALL across various architectures in image classification, and provide a state-of-the-art solution that uniquely integrates lifelong learning and privacy-aware unlearning mechanisms for responsible AI applications.

Title: Nosy Layers, Noisy Fixes: Tackling DRAs in Federated Learning Systems using Explainable AI

Authors: Meghali Nandi, Arash Shaghaghi, Nazatul Haque Sultan, Gustavo Batista, Raymond K. Zhao, Sanjay Jha
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.10942
Pdf URL: https://arxiv.org/pdf/2505.10942
Copy Paste: [[2505.10942]] Nosy Layers, Noisy Fixes: Tackling DRAs in Federated Learning Systems using Explainable AI(https://arxiv.org/abs/2505.10942)
Keywords: privacy, protect, defense, attack, federate
Abstract: Federated Learning (FL) has emerged as a powerful paradigm for collaborative model training while keeping client data decentralized and private. However, it is vulnerable to Data Reconstruction Attacks (DRA) such as "LoKI" and "Robbing the Fed", where malicious models sent from the server to the client can reconstruct sensitive user data. To counter this, we introduce DRArmor, a novel defense mechanism that integrates Explainable AI with targeted detection and mitigation strategies for DRA. Unlike existing defenses that focus on the entire model, DRArmor identifies and addresses the root cause (i.e., malicious layers within the model that send gradients with malicious intent) by analyzing their contribution to the output and detecting inconsistencies in gradient values. Once these malicious layers are identified, DRArmor applies defense techniques such as noise injection, pixelation, and pruning to these layers rather than the whole model, minimizing the attack surface and preserving client data privacy. We evaluate DRArmor's performance against the advanced LoKI attack across diverse datasets, including MNIST, CIFAR-10, CIFAR-100, and ImageNet, in a 200-client FL setup. Our results demonstrate DRArmor's effectiveness in mitigating data leakage, achieving high True Positive and True Negative Rates of 0.910 and 0.890, respectively. Additionally, DRArmor maintains an average accuracy of 87%, effectively protecting client privacy without compromising model performance. Compared to existing defense mechanisms, DRArmor reduces the data leakage rate by 62.5% with datasets containing 500 samples per client.

Title: Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer

Authors: Seungyoon Lee, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10945
Pdf URL: https://arxiv.org/pdf/2505.10945
Copy Paste: [[2505.10945]] Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer(https://arxiv.org/abs/2505.10945)
Keywords: large language model
Abstract: Large Language Models (LLMs) increasingly incorporate multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model's embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies, to handle each non-overlapping token's embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods and achieves lower loss with accelerating faster convergence during language adaptation. Notably, SALT obtains remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.

Title: The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs

Authors: Makoto Sato
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2505.10948
Pdf URL: https://arxiv.org/pdf/2505.10948
Copy Paste: [[2505.10948]] The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs(https://arxiv.org/abs/2505.10948)
Keywords: large language model
Abstract: Large language models (LLMs), inspired by neuroscience, exhibit behaviors that often evoke a sense of personality and intelligence-yet the mechanisms behind these effects remain elusive. Here, we operationalize Conceptual Blending Theory (CBT) as an experimental framework, using prompt-based methods to reveal how LLMs blend and compress meaning. By systematically investigating Prompt-Induced Transitions (PIT) and Prompt-Induced Hallucinations (PIH), we uncover structural parallels and divergences between artificial and biological cognition. Our approach bridges linguistics, neuroscience, and empirical AI research, demonstrating that human-AI collaboration can serve as a living prototype for the future of cognitive science. This work proposes prompt engineering not just as a technical tool, but as a scientific method for probing the deep structure of meaning itself.

Title: Shackled Dancing: A Bit-Locked Diffusion Algorithm for Lossless and Controllable Image Steganography

Authors: Tianshuo Zhang, Gao Jia, Wenzhe Zhai, Rui Yann, Xianglei Xing
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10950
Pdf URL: https://arxiv.org/pdf/2505.10950
Copy Paste: [[2505.10950]] Shackled Dancing: A Bit-Locked Diffusion Algorithm for Lossless and Controllable Image Steganography(https://arxiv.org/abs/2505.10950)
Keywords: secure, security, robust, diffusion, generative
Abstract: Data steganography aims to conceal information within visual content, yet existing spatial- and frequency-domain approaches suffer from trade-offs between security, capacity, and perceptual quality. Recent advances in generative models, particularly diffusion models, offer new avenues for adaptive image synthesis, but integrating precise information embedding into the generative process remains challenging. We introduce Shackled Dancing Diffusion, or SD$^2$, a plug-and-play generative steganography method that combines bit-position locking with diffusion sampling injection to enable controllable information embedding within the generative trajectory. SD$^2$ leverages the expressive power of diffusion models to synthesize diverse carrier images while maintaining full message recovery with $100\%$ accuracy. Our method achieves a favorable balance between randomness and constraint, enhancing robustness against steganalysis without compromising image fidelity. Extensive experiments show that SD$^2$ substantially outperforms prior methods in security, embedding capacity, and stability. This algorithm offers new insights into controllable generation and opens promising directions for secure visual communication.

Title: SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.10951
Pdf URL: https://arxiv.org/pdf/2505.10951
Copy Paste: [[2505.10951]] SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache(https://arxiv.org/abs/2505.10951)
Keywords: large language model
Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68$\times$ reduction in time-to-first-token (TTFT).

Title: Relational Graph Transformer

Authors: Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.10960
Pdf URL: https://arxiv.org/pdf/2505.10960
Copy Paste: [[2505.10960]] Relational Graph Transformer(https://arxiv.org/abs/2505.10960)
Keywords: transformer
Abstract: Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.

Title: Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Authors: Xinlu He, Jacob Whitehill
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.10975
Pdf URL: https://arxiv.org/pdf/2505.10975
Copy Paste: [[2505.10975]] Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio(https://arxiv.org/abs/2505.10975)
Keywords: robust, segmentation
Abstract: Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

Title: Group-in-Group Policy Optimization for LLM Agent Training

Authors: Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10978
Pdf URL: https://arxiv.org/pdf/2505.10978
Copy Paste: [[2505.10978]] Group-in-Group Policy Optimization for LLM Agent Training(https://arxiv.org/abs/2505.10978)
Keywords: large language model
Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of > 12\% on ALFWorld and > 9\% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

Title: GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models

Authors: Haozheng Luo, Chenghao Qiu, Yimin Wang, Shang Wu, Jiahao Yu, Han Liu, Binghui Wang, Yan Chen
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.10983
Pdf URL: https://arxiv.org/pdf/2505.10983
Copy Paste: [[2505.10983]] GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models(https://arxiv.org/abs/2505.10983)
Keywords: defense, attack, robust, generative
Abstract: We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.

Title: ReaCritic: Large Reasoning Transformer-based DRL Critic-model Scaling For Heterogeneous Networks

Authors: Feiran You, Hongyang Du
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2505.10992
Pdf URL: https://arxiv.org/pdf/2505.10992
Copy Paste: [[2505.10992]] ReaCritic: Large Reasoning Transformer-based DRL Critic-model Scaling For Heterogeneous Networks(https://arxiv.org/abs/2505.10992)
Keywords: transformer, large language model
Abstract: Heterogeneous Networks (HetNets) pose critical challenges for intelligent management due to the diverse user requirements and time-varying wireless conditions. These factors introduce significant decision complexity, which limits the adaptability of existing Deep Reinforcement Learning (DRL) methods. In many DRL algorithms, especially those involving value-based or actor-critic structures, the critic component plays a key role in guiding policy learning by estimating value functions. However, conventional critic models often use shallow architectures that map observations directly to scalar estimates, limiting their ability to handle multi-task complexity. In contrast, recent progress in inference-time scaling of Large Language Models (LLMs) has shown that generating intermediate reasoning steps can significantly improve decision quality. Motivated by this, we propose ReaCritic, a large reasoning transformer-based criticmodel scaling scheme that brings reasoning ability into DRL. ReaCritic performs horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It is compatible with a broad range of value-based and actor-critic DRL algorithms and enhances generalization in dynamic wireless environments. Extensive experiments demonstrate that ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks.

Title: Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark

Authors: Yunkang Cao, Yuqi Cheng, Xiaohao Xu, Yiheng Zhang, Yihan Sun, Yuxiang Tan, Yuxin Zhang, Xiaonan Huang, Weiming Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10996
Pdf URL: https://arxiv.org/pdf/2505.10996
Copy Paste: [[2505.10996]] Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark(https://arxiv.org/abs/2505.10996)
Keywords: robust
Abstract: The practical deployment of Visual Anomaly Detection (VAD) systems is hindered by their sensitivity to real-world imaging variations, particularly the complex interplay between viewpoint and illumination which drastically alters defect visibility. Current benchmarks largely overlook this critical challenge. We introduce Multi-View Multi-Illumination Anomaly Detection (M2AD), a new large-scale benchmark comprising 119,880 high-resolution images designed explicitly to probe VAD robustness under such interacting conditions. By systematically capturing 999 specimens across 10 categories using 12 synchronized views and 10 illumination settings (120 configurations total), M2AD enables rigorous evaluation. We establish two evaluation protocols: M2AD-Synergy tests the ability to fuse information across diverse configurations, and M2AD-Invariant measures single-image robustness against realistic view-illumination effects. Our extensive benchmarking shows that state-of-the-art VAD methods struggle significantly on M2AD, demonstrating the profound challenge posed by view-illumination interplay. This benchmark serves as an essential tool for developing and validating VAD methods capable of overcoming real-world complexities. Our full dataset and test suite will be released at this https URL to facilitate the field.

Title: DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

Authors: Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.10999
Pdf URL: https://arxiv.org/pdf/2505.10999
Copy Paste: [[2505.10999]] DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning(https://arxiv.org/abs/2505.10999)
Keywords: diffusion, generative
Abstract: While diffusion models have gained prominence in image synthesis, their generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding. However, two key questions remain: 1) Can these representations be leveraged to improve the training of diffusion models themselves, rather than solely benefiting downstream tasks? 2) Can the feature quality be enhanced to rival or even surpass modern self-supervised learners, without compromising generative capability? This work addresses these questions by introducing self-conditioning, a straightforward yet effective mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers, forming a tighter bottleneck that condenses high-level semantics to improve generation. Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures. Crucially, self-conditioning facilitates an effective integration of discriminative techniques, such as contrastive self-distillation, directly into diffusion models without sacrificing generation quality. Extensive experiments on pixel-space and latent-space datasets show that in linear evaluations, our enhanced diffusion models, particularly UViT and DiT, serve as strong representation learners, surpassing various self-supervised models.

Title: Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11004
Pdf URL: https://arxiv.org/pdf/2505.11004
Copy Paste: [[2505.11004]] Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning(https://arxiv.org/abs/2505.11004)
Keywords: security, interpretability, transformer
Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

Title: Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs

Authors: Ye Kyaw Thu, Thazin Myint Oo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11008
Pdf URL: https://arxiv.org/pdf/2505.11008
Copy Paste: [[2505.11008]] Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs(https://arxiv.org/abs/2505.11008)
Keywords: robust, transformer
Abstract: This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.

Title: Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Authors: Jiangxu Wu, Cong Wang, TianHuang Su, Jun Yang, Haozhi Lin, Chao Zhang, Ming Peng, Kai Shi, SongPan Yang, BinQing Pan, ZiXian Li, Ni Yang, ZhenYu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11010
Pdf URL: https://arxiv.org/pdf/2505.11010
Copy Paste: [[2505.11010]] Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models(https://arxiv.org/abs/2505.11010)
Keywords: large language model
Abstract: The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.

Title: Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

Authors: Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.11013
Pdf URL: https://arxiv.org/pdf/2505.11013
Copy Paste: [[2505.11013]] Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion(https://arxiv.org/abs/2505.11013)
Keywords: robust, diffusion
Abstract: Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence.

Title: WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

Authors: An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, Yuliang Liu, Xiang Bai, Can Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11015
Pdf URL: https://arxiv.org/pdf/2505.11015
Copy Paste: [[2505.11015]] WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?(https://arxiv.org/abs/2505.11015)
Keywords: robust, large language model
Abstract: The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at this https URL.

Title: GoLeash: Mitigating Golang Software Supply Chain Attacks with Runtime Policy Enforcement

Authors: Carmine Cesarano, Martin Monperrus, Roberto Natella
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.11016
Pdf URL: https://arxiv.org/pdf/2505.11016
Copy Paste: [[2505.11016]] GoLeash: Mitigating Golang Software Supply Chain Attacks with Runtime Policy Enforcement(https://arxiv.org/abs/2505.11016)
Keywords: security, attack
Abstract: Modern software supply chain attacks consist of introducing new, malicious capabilities into trusted third-party software components, in order to propagate to a victim through a package dependency chain. These attacks are especially concerning for the Go language ecosystem, which is extensively used in critical cloud infrastructures. We present GoLeash, a novel system that applies the principle of least privilege at the package-level granularity, by enforcing distinct security policies for each package in the supply chain. This finer granularity enables GoLeash to detect malicious packages more precisely than traditional sandboxing that handles security policies at process- or container-level. Moreover, GoLeash remains effective under obfuscation, can overcome the limitations of static analysis, and incurs acceptable runtime overhead.

Title: Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting

Authors: Wenjie Ou, Zhishuo Zhao, Dongyue Guo, Yi Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11017
Pdf URL: https://arxiv.org/pdf/2505.11017
Copy Paste: [[2505.11017]] Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting(https://arxiv.org/abs/2505.11017)
Keywords: transformer, large language model
Abstract: Time series forecasting is critical across multiple domains, where time series data exhibits both local patterns and global dependencies. While Transformer-based methods effectively capture global dependencies, they often overlook short-term local variations in time series. Recent methods that adapt large language models (LLMs) into time series forecasting inherit this limitation by treating LLMs as black-box encoders, relying solely on the final-layer output and underutilizing hierarchical representations. To address this limitation, we propose Logo-LLM, a novel LLM-based framework that explicitly extracts and models multi-scale temporal features from different layers of a pre-trained LLM. Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends. Moreover, Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers. Extensive experiments demonstrate that Logo-LLM achieves superior performance across diverse benchmarks, with strong generalization in few-shot and zero-shot settings while maintaining low computational overhead.

Title: Rethinking the Mean Teacher Strategy from the Perspective of Self-paced Learning

Authors: Pengchen Zhang, Alan J.X. Guo, Sipin Luo, Zhe Han, Lin Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11018
Pdf URL: https://arxiv.org/pdf/2505.11018
Copy Paste: [[2505.11018]] Rethinking the Mean Teacher Strategy from the Perspective of Self-paced Learning(https://arxiv.org/abs/2505.11018)
Keywords: segmentation
Abstract: Semi-supervised medical image segmentation has attracted significant attention due to its potential to reduce manual annotation costs. The mean teacher (MT) strategy, commonly understood as introducing smoothed, temporally lagged consistency regularization, has demonstrated strong performance across various tasks in this field. In this work, we reinterpret the MT strategy on supervised data as a form of self-paced learning, regulated by the output agreement between the temporally lagged teacher model and the ground truth labels. This idea is further extended to incorporate agreement between a temporally lagged model and a cross-architectural model, which offers greater flexibility in regulating the learning pace and enables application to unlabeled data. Specifically, we propose dual teacher-student learning (DTSL), a framework that introduces two groups of teacher-student models with different architectures. The output agreement between the cross-group teacher and student models is used as pseudo-labels, generated via a Jensen-Shannon divergence-based consensus label generator (CLG). Extensive experiments on popular datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches. Ablation studies further validate the effectiveness of the proposed modules.

Title: Informed, but Not Always Improved: Challenging the Benefit of Background Knowledge in GNNs

Authors: Kutalmış Coşkun, Ivo Kavisanczki, Amin Mirzaei, Tom Siegl, Bjarne C. Hiller, Stefan Lüdtke, Martin Becker
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11023
Pdf URL: https://arxiv.org/pdf/2505.11023
Copy Paste: [[2505.11023]] Informed, but Not Always Improved: Challenging the Benefit of Background Knowledge in GNNs(https://arxiv.org/abs/2505.11023)
Keywords: robust
Abstract: In complex and low-data domains such as biomedical research, incorporating background knowledge (BK) graphs, such as protein-protein interaction (PPI) networks, into graph-based machine learning pipelines is a promising research direction. However, while BK is often assumed to improve model performance, its actual contribution and the impact of imperfect knowledge remain poorly understood. In this work, we investigate the role of BK in an important real-world task: cancer subtype classification. Surprisingly, we find that (i) state-of-the-art GNNs using BK perform no better than uninformed models like linear regression, and (ii) their performance remains largely unchanged even when the BK graph is heavily perturbed. To understand these unexpected results, we introduce an evaluation framework, which employs (i) a synthetic setting where the BK is clearly informative and (ii) a set of perturbations that simulate various imperfections in BK graphs. With this, we test the robustness of BK-aware models in both synthetic and real-world biomedical settings. Our findings reveal that careful alignment of GNN architectures and BK characteristics is necessary but holds the potential for significant performance improvements.

Title: OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

Authors: Xiao Zhang, Huiyuan Lai, Qianru Meng, Johan Bos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11031
Pdf URL: https://arxiv.org/pdf/2505.11031
Copy Paste: [[2505.11031]] OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning(https://arxiv.org/abs/2505.11031)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of LLMs' ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs' proficiency in handling ontologies -- formal, symbolic representations of domain knowledge through concepts, relationships, and instances. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 58,981 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing proficiency in understanding ontological knowledge but substantial weaknesses in reasoning and learning tasks. These findings highlight fundamental limitations in LLMs' capability to process symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.

Title: CleanPatrick: A Benchmark for Image Data Cleaning

Authors: Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11034
Pdf URL: https://arxiv.org/pdf/2505.11034
Copy Paste: [[2505.11034]] CleanPatrick: A Benchmark for Image Data Cleaning(https://arxiv.org/abs/2505.11034)
Keywords: robust
Abstract: Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.

Title: Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios

Authors: Kihun Hong, Sejun Park, Ganguk Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11035
Pdf URL: https://arxiv.org/pdf/2505.11035
Copy Paste: [[2505.11035]] Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios(https://arxiv.org/abs/2505.11035)
Keywords: federate
Abstract: Federated learning (FL) has attracted significant attention for enabling collaborative learning without exposing private data. Among the primary variants of FL, vertical federated learning (VFL) addresses feature-partitioned data held by multiple institutions, each holding complementary information for the same set of users. However, existing VFL methods often impose restrictive assumptions such as a small number of participating parties, fully aligned data, or only using labeled data. In this work, we reinterpret alignment gaps in VFL as missing data problems and propose a unified framework that accommodates both training and inference under arbitrary alignment and labeling scenarios, while supporting diverse missingness mechanisms. In the experiments on 168 configurations spanning four benchmark datasets, six training-time missingness patterns, and seven testing-time missingness patterns, our method outperforms all baselines in 160 cases with an average gap of 9.6 percentage points over the next-best competitors. To the best of our knowledge, this is the first VFL framework to jointly handle arbitrary data alignment, unlabeled data, and multi-party collaboration all at once.

Title: Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11040
Pdf URL: https://arxiv.org/pdf/2505.11040
Copy Paste: [[2505.11040]] Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers(https://arxiv.org/abs/2505.11040)
Keywords: transformer
Abstract: Recent advances in transformer architectures deeply enhance long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However,such a sampling limits crucial keys' capturing, which in turn raises the overall perplexity. In this paper, we propose a pre-scoring mechanism to assist HyperAttention to prioritize significant keys. Specifically, we introduce three scoring methods: K-means clustering, K-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention's original uniform residual sampling entirely, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces computational overhead, its combination with HyperAttention remains 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving Transformer's efficiency.

Title: NeuralSurv: Deep Survival Analysis with Bayesian Uncertainty Quantification

Authors: Mélodie Monod, Alessandro Micheli, Samir Bhatt
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.11054
Pdf URL: https://arxiv.org/pdf/2505.11054
Copy Paste: [[2505.11054]] NeuralSurv: Deep Survival Analysis with Bayesian Uncertainty Quantification(https://arxiv.org/abs/2505.11054)
Keywords: robust
Abstract: We introduce NeuralSurv, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non-parametric, architecture-agnostic framework flexibly captures time-varying covariate-risk relationships in continuous time via a novel two-stage data-augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean-field variational algorithm with coordinate-ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, NeuralSurv delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data-scarce regimes by enhancing model calibration and providing robust, well-calibrated uncertainty estimates for the survival function.

Title: Side Channel Analysis in Homomorphic Encryption

Authors: Baraq Ghaleb, William J Buchanan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.11058
Pdf URL: https://arxiv.org/pdf/2505.11058
Copy Paste: [[2505.11058]] Side Channel Analysis in Homomorphic Encryption(https://arxiv.org/abs/2505.11058)
Keywords: privacy, attack
Abstract: Homomorphic encryption provides many opportunities for privacy-aware processing, including with methods related to machine learning. Many of our existing cryptographic methods have been shown in the past to be susceptible to side channel attacks. With these, the implementation of the cryptographic methods can reveal information about the private keys used, the result, or even the original plaintext. An example of this includes the processing of the RSA exponent using the Montgomery method, and where 0's and 1's differ in their processing time for modular exponentiation. With FHE, we typically use lattice methods, and which can have particular problems in their implementation in relation to side channel leakage. This paper aims to outline a range of weaknesses within FHE implementations as related to side channel analysis. It outlines a categorization for side-channel analysis, some case studies, and mitigation strategies.

Title: Assessing the Performance of Analog Training for Transfer Learning

Authors: Omobayode Fagbohungbe, Corey Lammie, Malte J. Rasch, Takashi Ando, Tayfun Gokmen, Vijay Narayanan
Subjects: cs.LG, cs.AI, cs.AR, cs.CV, cs.DC, cs.NE
Abstract URL: https://arxiv.org/abs/2505.11067
Pdf URL: https://arxiv.org/pdf/2505.11067
Copy Paste: [[2505.11067]] Assessing the Performance of Analog Training for Transfer Learning(https://arxiv.org/abs/2505.11067)
Keywords: robust
Abstract: Analog in-memory computing is a next-generation computing paradigm that promises fast, parallel, and energy-efficient deep learning training and transfer learning (TL). However, achieving this promise has remained elusive due to a lack of suitable training algorithms. Analog memory devices exhibit asymmetric and non-linear switching behavior in addition to device-to-device variation, meaning that most, if not all, of the current off-the-shelf training algorithms cannot achieve good training outcomes. Also, recently introduced algorithms have enjoyed limited attention, as they require bi-directionally switching devices of unrealistically high symmetry and precision and are highly sensitive. A new algorithm chopped TTv2 (c-TTv2), has been introduced, which leverages the chopped technique to address many of the challenges mentioned above. In this paper, we assess the performance of the c-TTv2 algorithm for analog TL using a Swin-ViT model on a subset of the CIFAR100 dataset. We also investigate the robustness of our algorithm to changes in some device specifications, including weight transfer noise, symmetry point skew, and symmetry point variability

Title: Towards Self-Improvement of Diffusion Models via Group Preference Optimization

Authors: Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, Mingyu Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11070
Pdf URL: https://arxiv.org/pdf/2505.11070
Copy Paste: [[2505.11070]] Towards Self-Improvement of Diffusion Models via Group Preference Optimization(https://arxiv.org/abs/2505.11070)
Keywords: diffusion
Abstract: Aligning text-to-image (T2I) diffusion models with Direct Preference Optimization (DPO) has shown notable improvements in generation quality. However, applying DPO to T2I faces two challenges: the sensitivity of DPO to preference pairs and the labor-intensive process of collecting and annotating high-quality data. In this work, we demonstrate that preference pairs with marginal differences can degrade DPO performance. Since DPO relies exclusively on relative ranking while disregarding the absolute difference of pairs, it may misclassify losing samples as wins, or vice versa. We empirically show that extending the DPO from pairwise to groupwise and incorporating reward standardization for reweighting leads to performance gains without explicit data selection. Furthermore, we propose Group Preference Optimization (GPO), an effective self-improvement method that enhances performance by leveraging the model's own capabilities without requiring external data. Extensive experiments demonstrate that GPO is effective across various diffusion models and tasks. Specifically, combining with widely used computer vision models, such as YOLO and OCR, the GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points. Notably, as a plug-and-play method, no extra overhead is introduced during inference.

Title: Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation

Authors: Jianghang Lin, Yilin Lu, Yunhang Shen, Chaoyang Zhu, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11075
Pdf URL: https://arxiv.org/pdf/2505.11075
Copy Paste: [[2505.11075]] Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation(https://arxiv.org/abs/2505.11075)
Keywords: segmentation
Abstract: Semi-Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data. This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo-labels of instance categories and pixel masks. We find that the prevalent practice of filtering instance pseudo-labels assessing both class and mask quality with a single score threshold, frequently leads to compromises in the trade-off between the qualities of class and mask labels. In this paper, we introduce a novel Pseudo-Label Quality Decoupling and Correction (PL-DC) framework for SSIS to tackle the above challenges. Firstly, at the instance level, a decoupled dual-threshold filtering mechanism is designed to decouple class and mask quality estimations for instance-level pseudo-labels, thereby independently controlling pixel classifying and grouping qualities. Secondly, at the category level, we introduce a dynamic instance category correction module to dynamically correct the pseudo-labels of instance categories, effectively alleviating category confusion. Lastly, we introduce a pixel-level mask uncertainty-aware mechanism at the pixel level to re-weight the mask loss for different pixels, thereby reducing the impact of noise introduced by pixel-level mask pseudo-labels. Extensive experiments on the COCO and Cityscapes datasets demonstrate that the proposed PL-DC achieves significant performance improvements, setting new state-of-the-art results for SSIS. Notably, our PL-DC shows substantial gains even with minimal labeled data, achieving an improvement of +11.6 mAP with just 1% COCO labeled data and +15.5 mAP with 5% Cityscapes labeled data. The code will be public.

Title: Addition is almost all you need: Compressing neural networks with double binary factorization

Authors: Vladimír Boža, Vladimír Macko
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11076
Pdf URL: https://arxiv.org/pdf/2505.11076
Copy Paste: [[2505.11076]] Addition is almost all you need: Compressing neural networks with double binary factorization(https://arxiv.org/abs/2505.11076)
Keywords: large language model
Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: this https URL

Title: ShiQ: Bringing back Bellman to LLMs

Authors: Pierre Clavier, Nathan Grinsztajn, Raphael Avalos, Yannis Flet-Berliac, Irem Ergun, Omar D. Domingues, Eugene Tarassov, Olivier Pietquin, Pierre H. Richemond, Florian Strub, Matthieu Geist
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11081
Pdf URL: https://arxiv.org/pdf/2505.11081
Copy Paste: [[2505.11081]] ShiQ: Bringing back Bellman to LLMs(https://arxiv.org/abs/2505.11081)
Keywords: large language model
Abstract: The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings

Title: Blockchain-Enabled Decentralized Privacy-Preserving Group Purchasing for Energy Plans

Authors: Sid Chi-Kin Chau, Yue Zhou
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.11094
Pdf URL: https://arxiv.org/pdf/2505.11094
Copy Paste: [[2505.11094]] Blockchain-Enabled Decentralized Privacy-Preserving Group Purchasing for Energy Plans(https://arxiv.org/abs/2505.11094)
Keywords: secure, privacy, fair
Abstract: Retail energy markets are increasingly consumer-oriented, thanks to a growing number of energy plans offered by a plethora of energy suppliers, retailers and intermediaries. To maximize the benefits of competitive retail energy markets, group purchasing is an emerging paradigm that aggregates consumers' purchasing power by coordinating switch decisions to specific energy providers for discounted energy plans. Traditionally, group purchasing is mediated by a trusted third-party, which suffers from the lack of privacy and transparency. In this paper, we introduce a novel paradigm of decentralized privacy-preserving group purchasing, empowered by privacy-preserving blockchain and secure multi-party computation, to enable users to form a coalition for coordinated switch decisions in a decentralized manner, without a trusted third-party. The coordinated switch decisions are determined by a competitive online algorithm, based on users' private consumption data and current energy plan tariffs. Remarkably, no private user consumption data will be revealed to others in the online decision-making process, which is carried out in a transparently verifiable manner to eliminate frauds from dishonest users and supports fair mutual compensations by sharing the switching costs to incentivize group purchasing. We implemented our decentralized group purchasing solution as a smart contract on Solidity-supported blockchain platform (e.g., Ethereum), and provide extensive empirical evaluation.

Title: Towards Better Evaluation for Generated Patent Claims

Authors: Lekang Jiang, Pascal A Scherz, Stephan Goetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11095
Pdf URL: https://arxiv.org/pdf/2505.11095
Copy Paste: [[2505.11095]] Towards Better Evaluation for Generated Patent Claims(https://arxiv.org/abs/2505.11095)
Keywords: protect, large language model
Abstract: Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems.

Title: Verifiably Forgotten? Gradient Differences Still Enable Data Reconstruction in Federated Unlearning

Authors: Fuyao Zhang, Wenjie Li, Yurong Hao, Xinyu Yan, Yang Cao, Wei Yang Bryan Lim
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.11097
Pdf URL: https://arxiv.org/pdf/2505.11097
Copy Paste: [[2505.11097]] Verifiably Forgotten? Gradient Differences Still Enable Data Reconstruction in Federated Unlearning(https://arxiv.org/abs/2505.11097)
Keywords: privacy, defense, attack, robust, extraction, federate
Abstract: Federated Unlearning (FU) has emerged as a critical compliance mechanism for data privacy regulations, requiring unlearned clients to provide verifiable Proof of Federated Unlearning (PoFU) to auditors upon data removal requests. However, we uncover a significant privacy vulnerability: when gradient differences are used as PoFU, honest-but-curious auditors may exploit mathematical correlations between gradient differences and forgotten samples to reconstruct the latter. Such reconstruction, if feasible, would face three key challenges: (i) restricted auditor access to client-side data, (ii) limited samples derivable from individual PoFU, and (iii) high-dimensional redundancy in gradient differences. To overcome these challenges, we propose Inverting Gradient difference to Forgotten data (IGF), a novel learning-based reconstruction attack framework that employs Singular Value Decomposition (SVD) for dimensionality reduction and feature extraction. IGF incorporates a tailored pixel-level inversion model optimized via a composite loss that captures both structural and semantic cues. This enables efficient and high-fidelity reconstruction of large-scale samples, surpassing existing methods. To counter this novel attack, we design an orthogonal obfuscation defense that preserves PoFU verification utility while preventing sensitive forgotten data reconstruction. Experiments across multiple datasets validate the effectiveness of the attack and the robustness of the defense. The code is available at this https URL.

Title: Hybrid-Emba3D: Geometry-Aware and Cross-Path Feature Hybrid Enhanced State Space Model for Point Cloud Classification

Authors: Bin Liu, Chunyang Wang, Xuelian Liu, Guan Xi, Ge Zhang, Ziteng Yao, Mengxue Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11099
Pdf URL: https://arxiv.org/pdf/2505.11099
Copy Paste: [[2505.11099]] Hybrid-Emba3D: Geometry-Aware and Cross-Path Feature Hybrid Enhanced State Space Model for Point Cloud Classification(https://arxiv.org/abs/2505.11099)
Keywords: extraction, transformer
Abstract: The point cloud classification tasks face the dual challenge of efficiently extracting local geometric features while maintaining model complexity. The Mamba architecture utilizes the linear complexity advantage of state space models (SSMs) to overcome the computational bottleneck of Transformers while balancing global modeling capabilities. However, the inherent contradiction between its unidirectional dependency and the unordered nature of point clouds impedes modeling spatial correlation in local neighborhoods, thus constraining geometric feature extraction. This paper proposes Hybrid-Emba3D, a bidirectional Mamba model enhanced by geometry-feature coupling and cross-path feature hybridization. The Local geometric pooling with geometry-feature coupling mechanism significantly enhances local feature discriminative power via coordinated propagation and dynamic aggregation of geometric information between local center points and their neighborhoods, without introducing additional parameters. The designed Collaborative feature enhancer adopts dual-path hybridization, effectively handling local mutations and sparse key signals, breaking through the limitations of traditional SSM long-range modeling. Experimental results demonstrate that the proposed model achieves a new SOTA classification accuracy of 95.99% on ModelNet40 with only 0.03M additional.

Title: MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Authors: Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.11109
Pdf URL: https://arxiv.org/pdf/2505.11109
Copy Paste: [[2505.11109]] MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark(https://arxiv.org/abs/2505.11109)
Keywords: generative
Abstract: We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: this https URL.

Title: Deepfake Forensic Analysis: Source Dataset Attribution and Legal Implications of Synthetic Media Manipulation

Authors: Massimiliano Cassia, Luca Guarnera, Mirko Casu, Ignazio Zangara, Sebastiano Battiato
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11110
Pdf URL: https://arxiv.org/pdf/2505.11110
Copy Paste: [[2505.11110]] Deepfake Forensic Analysis: Source Dataset Attribution and Legal Implications of Synthetic Media Manipulation(https://arxiv.org/abs/2505.11110)
Keywords: privacy, protect, generative
Abstract: Synthetic media generated by Generative Adversarial Networks (GANs) pose significant challenges in verifying authenticity and tracing dataset origins, raising critical concerns in copyright enforcement, privacy protection, and legal compliance. This paper introduces a novel forensic framework for identifying the training dataset (e.g., CelebA or FFHQ) of GAN-generated images through interpretable feature analysis. By integrating spectral transforms (Fourier/DCT), color distribution metrics, and local feature descriptors (SIFT), our pipeline extracts discriminative statistical signatures embedded in synthetic outputs. Supervised classifiers (Random Forest, SVM, XGBoost) achieve 98-99% accuracy in binary classification (real vs. synthetic) and multi-class dataset attribution across diverse GAN architectures (StyleGAN, AttGAN, GDWCT, StarGAN, and StyleGAN2). Experimental results highlight the dominance of frequency-domain features (DCT/FFT) in capturing dataset-specific artifacts, such as upsampling patterns and spectral irregularities, while color histograms reveal implicit regularization strategies in GAN training. We further examine legal and ethical implications, showing how dataset attribution can address copyright infringement, unauthorized use of personal data, and regulatory compliance under frameworks like GDPR and California's AB 602. Our framework advances accountability and governance in generative modeling, with applications in digital forensics, content moderation, and intellectual property litigation.

Title: FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation

Authors: Lin Zhu, Yijun Bian, Lei You
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.11111
Pdf URL: https://arxiv.org/pdf/2505.11111
Copy Paste: [[2505.11111]] FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation(https://arxiv.org/abs/2505.11111)
Keywords: fair
Abstract: Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of this http URL code is on this https URL.

Title: Dual-Balancing for Physics-Informed Neural Networks

Authors: Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2505.11117
Pdf URL: https://arxiv.org/pdf/2505.11117
Copy Paste: [[2505.11117]] Dual-Balancing for Physics-Informed Neural Networks(https://arxiv.org/abs/2505.11117)
Keywords: robust
Abstract: Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty levels. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at this https URL.

Title: FedDuA: Doubly Adaptive Federated Learning

Authors: Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.11126
Pdf URL: https://arxiv.org/pdf/2505.11126
Copy Paste: [[2505.11126]] FedDuA: Doubly Adaptive Federated Learning(https://arxiv.org/abs/2505.11126)
Keywords: robust, federate
Abstract: Federated learning is a distributed learning framework where clients collaboratively train a global model without sharing their raw data. FedAvg is a popular algorithm for federated learning, but it often suffers from slow convergence due to the heterogeneity of local datasets and anisotropy in the parameter space. In this work, we formalize the central server optimization procedure through the lens of mirror descent and propose a novel framework, called FedDuA, which adaptively selects the global learning rate based on both inter-client and coordinate-wise heterogeneity in the local updates. We prove that our proposed doubly adaptive step-size rule is minimax optimal and provide a convergence analysis for convex objectives. Although the proposed method does not require additional communication or computational cost on clients, extensive numerical experiments show that our proposed framework outperforms baselines in various settings and is robust to the choice of hyperparameters.

Title: What's Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold

Authors: Simone Azeglio, Arianna Di Bernardo
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.11128
Pdf URL: https://arxiv.org/pdf/2505.11128
Copy Paste: [[2505.11128]] What's Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold(https://arxiv.org/abs/2505.11128)
Keywords: diffusion
Abstract: Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.

Title: PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video

Authors: Makoto Yamada, Kian Ming A. Chai, Ayoub Rhim, Satoki Ishikawa, Mohammad Sabokrou, Yao-Hung Hubert Tsai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11129
Pdf URL: https://arxiv.org/pdf/2505.11129
Copy Paste: [[2505.11129]] PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video(https://arxiv.org/abs/2505.11129)
Keywords: robust, transformer
Abstract: Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.

Title: One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework

Authors: Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11131
Pdf URL: https://arxiv.org/pdf/2505.11131
Copy Paste: [[2505.11131]] One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework(https://arxiv.org/abs/2505.11131)
Keywords: diffusion
Abstract: Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability). In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability. Codes are available at this https URL.

Title: Fairness-aware Anomaly Detection via Fair Projection

Authors: Feng Xiao, Xiaoying Tang, Jicong Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11132
Pdf URL: https://arxiv.org/pdf/2505.11132
Copy Paste: [[2505.11132]] Fairness-aware Anomaly Detection via Fair Projection(https://arxiv.org/abs/2505.11132)
Keywords: security, fair
Abstract: Unsupervised anomaly detection is a critical task in many high-social-impact applications such as finance, healthcare, social media, and cybersecurity, where demographics involving age, gender, race, disease, etc, are used frequently. In these scenarios, possible bias from anomaly detection systems can lead to unfair treatment for different groups and even exacerbate social bias. In this work, first, we thoroughly analyze the feasibility and necessary assumptions for ensuring group fairness in unsupervised anomaly detection. Second, we propose a novel fairness-aware anomaly detection method FairAD. From the normal training data, FairAD learns a projection to map data of different demographic groups to a common target distribution that is simple and compact, and hence provides a reliable base to estimate the density of the data. The density can be directly used to identify anomalies while the common target distribution ensures fairness between different groups. Furthermore, we propose a threshold-free fairness metric that provides a global view for model's fairness, eliminating dependence on manual threshold selection. Experiments on real-world benchmarks demonstrate that our method achieves an improved trade-off between detection accuracy and fairness under both balanced and skewed data across different groups.

Title: Towards Robust Spiking Neural Networks:Mitigating Heterogeneous Training Vulnerability via Dominant Eigencomponent Projection

Authors: Desong Zhang, Jia Hu, Geyong Min
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.11134
Pdf URL: https://arxiv.org/pdf/2505.11134
Copy Paste: [[2505.11134]] Towards Robust Spiking Neural Networks:Mitigating Heterogeneous Training Vulnerability via Dominant Eigencomponent Projection(https://arxiv.org/abs/2505.11134)
Keywords: robust
Abstract: Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream method--direct encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Eigencomponent Projection (DEP). By orthogonally projecting gradients to precisely remove their dominant components, DEP effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DEP not only mitigates the vulnerability of SNNs to heterogeneous data poisoning, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer and more reliable SNN deployment.

Title: Covariance Density Neural Networks

Authors: Om Roy, Yashar Moshfeghi, Keith Smith
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11139
Pdf URL: https://arxiv.org/pdf/2505.11139
Copy Paste: [[2505.11139]] Covariance Density Neural Networks(https://arxiv.org/abs/2505.11139)
Keywords: robust
Abstract: Graph neural networks have re-defined how we model and predict on network data but there lacks a consensus on choosing the correct underlying graph structure on which to model signals. CoVariance Neural Networks (VNN) address this issue by using the sample covariance matrix as a Graph Shift Operator (GSO). Here, we improve on the performance of VNNs by constructing a Density Matrix where we consider the sample Covariance matrix as a quasi-Hamiltonian of the system in the space of random variables. Crucially, using this density matrix as the GSO allows components of the data to be extracted at different scales, allowing enhanced discriminability and performance. We show that this approach allows explicit control of the stability-discriminability trade-off of the network, provides enhanced robustness to noise compared to VNNs, and outperforms them in useful real-life applications where the underlying covariance matrix is informative. In particular, we show that our model can achieve strong performance in subject-independent Brain Computer Interface EEG motor imagery classification, outperforming EEGnet while being faster. This shows how covariance density neural networks provide a basis for the notoriously difficult task of transferability of BCIs when evaluated on unseen individuals.

Title: Scaling Reasoning can Improve Factuality in Large Language Models

Authors: Mike Zhang, Johannes Bjerva, Russa Biswas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11140
Pdf URL: https://arxiv.org/pdf/2505.11140
Copy Paste: [[2505.11140]] Scaling Reasoning can Improve Factuality in Large Language Models(https://arxiv.org/abs/2505.11140)
Keywords: large language model
Abstract: Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.

Title: Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

Authors: Yansheng Qiu, Li Xiao, Zhaopan Xu, Pengfei Zhou, Zheng Wang, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11141
Pdf URL: https://arxiv.org/pdf/2505.11141
Copy Paste: [[2505.11141]] Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans(https://arxiv.org/abs/2505.11141)
Keywords: large language model
Abstract: The goal of achieving Artificial General Intelligence (AGI) is to imitate humans and surpass them. Models such as OpenAI's o1, o3, and DeepSeek's R1 have demonstrated that large language models (LLMs) with human-like reasoning capabilities exhibit exceptional performance and are being gradually integrated into multimodal large language models (MLLMs). However, whether these models possess capabilities comparable to humans in handling reasoning tasks remains unclear at present. In this paper, we propose Human-Aligned Bench, a benchmark for fine-grained alignment of multimodal reasoning with human performance. Specifically, we collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions, encompassing four question types: visual reasoning, definition judgment, analogical reasoning, and logical judgment. More importantly, each question is accompanied by human success rates and options that humans are prone to choosing incorrectly. Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance. The findings on our benchmark provide insights into the development of the next-generation models.

Title: Learning Dense Hand Contact Estimation from Imbalanced Data

Authors: Daniel Sungho Jung, Kyoung Mu Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11152
Pdf URL: https://arxiv.org/pdf/2505.11152
Copy Paste: [[2505.11152]] Learning Dense Hand Contact Estimation from Imbalanced Data(https://arxiv.org/abs/2505.11152)
Keywords: fair
Abstract: Hands are essential to human interaction, and understanding contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of samples are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact samples. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes will be released.

Title: Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes

Authors: Ashok Arora, Neetesh Kumar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11153
Pdf URL: https://arxiv.org/pdf/2505.11153
Copy Paste: [[2505.11153]] Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes(https://arxiv.org/abs/2505.11153)
Keywords: transformer
Abstract: In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are commonly used to model these environments, but effective performance requires memory mechanisms to utilise past observations. While recurrence networks have traditionally addressed this need, transformer-based models have recently shown improved sample efficiency in RL tasks. However, their application to POMDPs remains underdeveloped, and their real-world deployment is constrained due to the high parameter count. This work introduces a novel bi-recurrent model architecture that improves sample efficiency and reduces model parameter count in POMDP scenarios. The architecture replaces the multiple feed forward layers with a single layer of bi-directional recurrence unit to better capture and utilize sequential dependencies and contextual information. This approach improves the model's ability to handle partial observability and increases sample efficiency, enabling effective learning from comparatively fewer interactions. To evaluate the performance of the proposed model architecture, experiments were conducted on a total of 23 POMDP environments. The proposed model architecture outperforms existing transformer-based, attention-based, and recurrence-based methods by a margin ranging from 87.39% to 482.04% on average across the 23 POMDP environments.

Title: MPMA: Preference Manipulation Attack Against Model Context Protocol

Authors: Zihan Wang, Hongwei Li, Rui Zhang, Yu Liu, Wenbo Jiang, Wenshu Fan, Qingchuan Zhao, Guowen Xu
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2505.11154
Pdf URL: https://arxiv.org/pdf/2505.11154
Copy Paste: [[2505.11154]] MPMA: Preference Manipulation Attack Against Model Context Protocol(https://arxiv.org/abs/2505.11154)
Keywords: security, defense, attack, robust, steal, fair, large language model
Abstract: Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack ($\mathtt{DPMA}$) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ($\mathtt{GAPMA}$). $\mathtt{GAPMA}$ employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that $\mathtt{GAPMA}$ balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.

Title: Attention on the Sphere

Authors: Boris Bonev, Max Rietmann, Andrea Paris, Alberto Carpentieri, Thorsten Kurth
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11157
Pdf URL: https://arxiv.org/pdf/2505.11157
Copy Paste: [[2505.11157]] Attention on the Sphere(https://arxiv.org/abs/2505.11157)
Keywords: transformer, segmentation
Abstract: We introduce a generalized attention mechanism for spherical domains, enabling Transformer architectures to natively process data defined on the two-dimensional sphere - a critical need in fields such as atmospheric physics, cosmology, and robotics, where preserving spherical symmetries and topology is essential for physical accuracy. By integrating numerical quadrature weights into the attention mechanism, we obtain a geometrically faithful spherical attention that is approximately rotationally equivariant, providing strong inductive biases and leading to better performance than Cartesian approaches. To further enhance both scalability and model performance, we propose neighborhood attention on the sphere, which confines interactions to geodesic neighborhoods. This approach reduces computational complexity and introduces the additional inductive bias for locality, while retaining the symmetry properties of our method. We provide optimized CUDA kernels and memory-efficient implementations to ensure practical applicability. The method is validated on three diverse tasks: simulating shallow water equations on the rotating sphere, spherical image segmentation, and spherical depth estimation. Across all tasks, our spherical Transformers consistently outperform their planar counterparts, highlighting the advantage of geometric priors for learning on spherical domains.

Title: SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11166
Pdf URL: https://arxiv.org/pdf/2505.11166
Copy Paste: [[2505.11166]] SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization(https://arxiv.org/abs/2505.11166)
Keywords: large language model
Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

Title: CheX-DS: Improving Chest X-ray Image Classification with Ensemble Learning Based on DenseNet and Swin Transformer

Authors: Xinran Li, Yu Liu, Xiujuan Xu, Xiaowei Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11168
Pdf URL: https://arxiv.org/pdf/2505.11168
Copy Paste: [[2505.11168]] CheX-DS: Improving Chest X-ray Image Classification with Ensemble Learning Based on DenseNet and Swin Transformer(https://arxiv.org/abs/2505.11168)
Keywords: transformer
Abstract: The automatic diagnosis of chest diseases is a popular and challenging task. Most current methods are based on convolutional neural networks (CNNs), which focus on local features while neglecting global features. Recently, self-attention mechanisms have been introduced into the field of computer vision, demonstrating superior performance. Therefore, this paper proposes an effective model, CheX-DS, for classifying long-tail multi-label data in the medical field of chest X-rays. The model is based on the excellent CNN model DenseNet for medical imaging and the newly popular Swin Transformer model, utilizing ensemble deep learning techniques to combine the two models and leverage the advantages of both CNNs and Transformers. The loss function of CheX-DS combines weighted binary cross-entropy loss with asymmetric loss, effectively addressing the issue of data imbalance. The NIH ChestX-ray14 dataset is selected to evaluate the model's effectiveness. The model outperforms previous studies with an excellent average AUC score of 83.76\%, demonstrating its superior performance.

Title: Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training

Authors: Myeonghwan Ahn, Sungjoo Yoo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11170
Pdf URL: https://arxiv.org/pdf/2505.11170
Copy Paste: [[2505.11170]] Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training(https://arxiv.org/abs/2505.11170)
Keywords: large language model
Abstract: Ever-growing scale of large language models (LLMs) is pushing for improved efficiency, favoring fully quantized training (FQT) over BF16. While FQT accelerates training, it faces consistency challenges and requires searching over an exponential number of cases, each needing over 200B tokens to ensure stability. Pseudo-quantization training (PQT) addresses the issues of FQT, although it is not well-studied. We explore the practical implications of PQT in detail and propose a noise distribution $R$ that is floating-point (FP)-friendly, with ideal properties including stochastic precision annealing. As a result, the proposed method serves as an effective theoretical foundation for low-precision FP parameters through PQT, utilizing efficient fake quantization via an addition and subsequent FP casting. We demonstrate that Gaussian weight sampling is (1) scalable: supports low-precision FP parameters down to FP6 and high-precision noise up to 9-bit with BF16 operator. The proposed method is (2) efficient: incurring computational overhead as low as 1.40\% on the A100 GPU in terms of Llama2 training tokens per second, and requiring 2 bytes per parameter in GPU memory. We demonstrate that PQT with Gaussian weight sampling is (3) stable: closely following or even surpassing performance of the BF16 baseline while pre-training GPT2 and Llama2 models with up to 1B parameters and 300B tokens.

Title: Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Authors: Hrishit Madhavi, Jacob Cherian, Yuvraj Khamkar, Dhananjay Bhagat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11177
Pdf URL: https://arxiv.org/pdf/2505.11177
Copy Paste: [[2505.11177]] Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline(https://arxiv.org/abs/2505.11177)
Keywords: extraction, transformer, large language model
Abstract: This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

Title: CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

Authors: Yixin Wan, Kai-Wei Chang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.11178
Pdf URL: https://arxiv.org/pdf/2505.11178
Copy Paste: [[2505.11178]] CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback(https://arxiv.org/abs/2505.11178)
Keywords: diffusion
Abstract: State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to provide fine-grained binary feedback on the correctness of each aspect of generation elements in model-generated images. This enables precise quantification of alignment between generated images and compositional prompts. Furthermore, we propose an alignment framework that uses CompQuest's feedback as preference signals to improve diffusion models' compositional image generation abilities. Using adjustable per-image preferences, our method is easily scalable and flexible for different tasks. Evaluation of 9 T2I models reveals that: (1) models remarkable struggle more with compositional tasks with more complex 3D-spatial configurations, and (2) a noticeable performance gap exists between open-source accessible models and closed-source commercial models. Further empirical study on using CompAlign for model alignment yield promising results: post-alignment diffusion models achieve remarkable improvements in compositional accuracy, especially on complex generation tasks, outperforming previous approaches.

Title: Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning

Authors: Yuzhuo Dai, Jiaqi Jin, Zhibin Dong, Siwei Wang, Xinwang Liu, En Zhu, Xihong Yang, Xinbiao Gan, Yu Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11182
Pdf URL: https://arxiv.org/pdf/2505.11182
Copy Paste: [[2505.11182]] Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning(https://arxiv.org/abs/2505.11182)
Keywords: robust
Abstract: In incomplete multi-view clustering (IMVC), missing data induce prototype shifts within views and semantic inconsistencies across views. A feasible solution is to explore cross-view consistency in paired complete observations, further imputing and aligning the similarity relationships inherently shared across views. Nevertheless, existing methods are constrained by two-tiered limitations: (1) Neither instance- nor cluster-level consistency learning construct a semantic space shared across views to learn consensus semantics. The former enforces cross-view instances alignment, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cross-view cluster counterparts while coarsely handling fine-grained intra-cluster relationships within views. (2) Excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. Thus, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). To bridge semantic gaps across all observations, we learn consensus prototypes from available data to discover a shared space, where semantically similar observations are pulled closer for consensus semantics learning. To capture semantic relationships within specific views, we design a heuristic graph clustering based on modularity to recover cluster structure with intra-cluster compactness and inter-cluster separation for cluster semantics enhancement. Extensive experiments demonstrate, compared to state-of-the-art competitors, FreeCSL achieves more confident and robust assignments on IMVC task.

Title: FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining

Authors: Myunsoo Kim, Seong-Woong Shim, Byung-Jun Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11192
Pdf URL: https://arxiv.org/pdf/2505.11192
Copy Paste: [[2505.11192]] FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining(https://arxiv.org/abs/2505.11192)
Keywords: robust
Abstract: False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across two widely adopted VLP frameworks (ALBEF, BLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

Title: DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Authors: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11196
Pdf URL: https://arxiv.org/pdf/2505.11196
Copy Paste: [[2505.11196]] DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling(https://arxiv.org/abs/2505.11196)
Keywords: diffusion, transformer, generative
Abstract: Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: this https URL.

Title: NoPE: The Counting Power of Transformers with No Positional Encodings

Authors: Chris Köcher, Alexander Kozachinskiy, Anthony Widjaja Lin, Marco Sälzer, Georg Zetzsche
Subjects: cs.CL, cs.FL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11199
Pdf URL: https://arxiv.org/pdf/2505.11199
Copy Paste: [[2505.11199]] NoPE: The Counting Power of Transformers with No Positional Encodings(https://arxiv.org/abs/2505.11199)
Keywords: transformer
Abstract: Positional Encodings (PEs) seem to be indispensable for ensuring expressiveness of transformers; without them attention transformers reduce to a bag-of-word model. NoPE-transformers (i.e. with No PEs) with unique hard attention mechanisms were very recently shown to only be able to express regular languages, i.e., with limited counting ability. This paper shows that, with average hard attention mechanisms, NoPE-transformers are still surprisingly expressive: they can express counting languages corresponding to nonnegative integer solutions to multivariate polynomial equations (i.e. Diophantine equations), reasoning about which is well-known to be undecidable. In fact, we provide a precise characterization of languages expressible by Average Hard Attention NoPE-Transformers (NoPE-AHATs): they correspond precisely to what we call \emph{semi-algebraic sets}, i.e., finite unions of sets of nonnegative integer solutions to systems of multivariate polynomial inequations. We obtain several interesting consequences of our characterization. Firstly, NoPE-transformers can express counting properties that are far more complex than established models like simplified counter machines and Petri nets, but cannot express a very simple counting property of PARITY. Secondly, the problem of analyzing NoPE-transformers is undecidable, e.g., whether a given NoPE transformer classifies all input strings in one class. To complement our results, we exhibit a counting language that is not expressible by average hard attention transformers even with arbitrary PEs but is expressible in the circuit complexity class TC$^0$, answering an open problem.

Title: Minimizing False-Positive Attributions in Explanations of Non-Linear Models

Authors: Anders Gjølbye, Stefan Haufe, Lars Kai Hansen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.11210
Pdf URL: https://arxiv.org/pdf/2505.11210
Copy Paste: [[2505.11210]] Minimizing False-Positive Attributions in Explanations of Non-Linear Models(https://arxiv.org/abs/2505.11210)
Keywords: generative
Abstract: Suppressor variables can influence model predictions without being dependent on the target outcome and they pose a significant challenge for Explainable AI (XAI) methods. These variables may cause false-positive feature attributions, undermining the utility of explanations. Although effective remedies exist for linear models, their extension to non-linear models and to instance-based explanations has remained limited. We introduce PatternLocal, a novel XAI technique that addresses this gap. PatternLocal begins with a locally linear surrogate, e.g. LIME, KernelSHAP, or gradient-based methods, and transforms the resulting discriminative model weights into a generative representation, thereby suppressing the influence of suppressor variables while preserving local fidelity. In extensive hyperparameter optimization on the XAI-TRIS benchmark, PatternLocal consistently outperformed other XAI methods and reduced false-positive attributions when explaining non-linear tasks, thereby enabling more reliable and actionable insights.

Title: HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Authors: Chengyu Huang, Zhengxin Zhang, Claire Cardie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11225
Pdf URL: https://arxiv.org/pdf/2505.11225
Copy Paste: [[2505.11225]] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization(https://arxiv.org/abs/2505.11225)
Keywords: large language model
Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs' concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

Title: Learning traffic flows: Graph Neural Networks for Metamodelling Traffic Assignment

Authors: Oskar Bohn Lassen, Serio Agriesti, Mohamed Eldafrawi, Daniele Gammelli, Guido Cantelmo, Guido Gentile, Francisco Camara Pereira
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11230
Pdf URL: https://arxiv.org/pdf/2505.11230
Copy Paste: [[2505.11230]] Learning traffic flows: Graph Neural Networks for Metamodelling Traffic Assignment(https://arxiv.org/abs/2505.11230)
Keywords: robust
Abstract: The Traffic Assignment Problem is a fundamental, yet computationally expensive, task in transportation modeling, especially for large-scale networks. Traditional methods require iterative simulations to reach equilibrium, making real-time or large-scale scenario analysis challenging. In this paper, we propose a learning-based approach using Message-Passing Neural Networks as a metamodel to approximate the equilibrium flow of the Stochastic User Equilibrium assignment. Our model is designed to mimic the algorithmic structure used in conventional traffic simulators allowing it to better capture the underlying process rather than just the data. We benchmark it against other conventional deep learning techniques and evaluate the model's robustness by testing its ability to predict traffic flows on input data outside the domain on which it was trained. This approach offers a promising solution for accelerating out-of-distribution scenario assessments, reducing computational costs in large-scale transportation planning, and enabling real-time decision-making.

Title: AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition

Authors: Haiyu Li, Charith Abhayaratne
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11232
Pdf URL: https://arxiv.org/pdf/2505.11232
Copy Paste: [[2505.11232]] AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition(https://arxiv.org/abs/2505.11232)
Keywords: robust, segmentation
Abstract: Event cameras, which capture brightness changes with high temporal resolution, inherently generate a significant amount of redundant and noisy data beyond essential object structures. The primary challenge in event-based object recognition lies in effectively removing this noise without losing critical spatial-temporal information. To address this, we propose an Adaptive Graph-based Noisy Data Removal framework for Event-based Object Recognition. Specifically, our approach integrates adaptive event segmentation based on normalized density analysis, a multifactorial edge-weighting mechanism, and adaptive graph-based denoising strategies. These innovations significantly enhance the integration of spatiotemporal information, effectively filtering noise while preserving critical structural features for robust recognition. Experimental evaluations on four challenging datasets demonstrate that our method achieves superior recognition accuracies of 83.77%, 76.79%, 99.30%, and 96.89%, surpassing existing graph-based methods by up to 8.79%, and improving noise reduction performance by up to 19.57%, with an additional accuracy gain of 6.26% compared to traditional Euclidean-based techniques.

Title: Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks

Authors: Wilson Wongso, Hao Xue, Flora D. Salim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11239
Pdf URL: https://arxiv.org/pdf/2505.11239
Copy Paste: [[2505.11239]] Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks(https://arxiv.org/abs/2505.11239)
Keywords: generative
Abstract: Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: this https URL

Title: Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

Authors: Fu-Yun Wang, Yunhao Shui, Jingtan Piao, Keqiang Sun, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11245
Pdf URL: https://arxiv.org/pdf/2505.11245
Copy Paste: [[2505.11245]] Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models(https://arxiv.org/abs/2505.11245)
Keywords: diffusion
Abstract: Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we argue that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance~(CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatile effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD1.5, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their alignment with human preferences.

Title: Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11254
Pdf URL: https://arxiv.org/pdf/2505.11254
Copy Paste: [[2505.11254]] Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction(https://arxiv.org/abs/2505.11254)
Keywords: transformer
Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

Title: DRAGON: A Large-Scale Dataset of Realistic Images Generated by Diffusion Models

Authors: Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, Alessandro Piva
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11257
Pdf URL: https://arxiv.org/pdf/2505.11257
Copy Paste: [[2505.11257]] DRAGON: A Large-Scale Dataset of Realistic Images Generated by Diffusion Models(https://arxiv.org/abs/2505.11257)
Keywords: robust, diffusion, large language model
Abstract: The remarkable ease of use of diffusion models for image generation has led to a proliferation of synthetic content online. While these models are often employed for legitimate purposes, they are also used to generate fake images that support misinformation and hate speech. Consequently, it is crucial to develop robust tools capable of detecting whether an image has been generated by such models. Many current detection methods, however, require large volumes of sample images for training. Unfortunately, due to the rapid evolution of the field, existing datasets often cover only a limited range of models and quickly become outdated. In this work, we introduce DRAGON, a comprehensive dataset comprising images from 25 diffusion models, spanning both recent advancements and older, well-established architectures. The dataset contains a broad variety of images representing diverse subjects. To enhance image realism, we propose a simple yet effective pipeline that leverages a large language model to expand input prompts, thereby generating more diverse and higher-quality outputs, as evidenced by improvements in standard quality metrics. The dataset is provided in multiple sizes (ranging from extra-small to extra-large) to accomodate different research scenarios. DRAGON is designed to support the forensic community in developing and evaluating detection and attribution techniques for synthetic content. Additionally, the dataset is accompanied by a dedicated test set, intended to serve as a benchmark for assessing the performance of newly developed methods.

Title: Equal is Not Always Fair: A New Perspective on Hyperspectral Representation Non-Uniformity

Authors: Wuzhou Quan, Mingqiang Wei, Jinhui Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11267
Pdf URL: https://arxiv.org/pdf/2505.11267
Copy Paste: [[2505.11267]] Equal is Not Always Fair: A New Perspective on Hyperspectral Representation Non-Uniformity(https://arxiv.org/abs/2505.11267)
Keywords: fair
Abstract: Hyperspectral image (HSI) representation is fundamentally challenged by pervasive non-uniformity, where spectral dependencies, spatial continuity, and feature efficiency exhibit complex and often conflicting behaviors. Most existing models rely on a unified processing paradigm that assumes homogeneity across dimensions, leading to suboptimal performance and biased representations. To address this, we propose FairHyp, a fairness-directed framework that explicitly disentangles and resolves the threefold non-uniformity through cooperative yet specialized modules. We introduce a Runge-Kutta-inspired spatial variability adapter to restore spatial coherence under resolution discrepancies, a multi-receptive field convolution module with sparse-aware refinement to enhance discriminative features while respecting inherent sparsity, and a spectral-context state space model that captures stable and long-range spectral dependencies via bidirectional Mamba scanning and statistical aggregation. Unlike one-size-fits-all solutions, FairHyp achieves dimension-specific adaptation while preserving global consistency and mutual reinforcement. This design is grounded in the view that non-uniformity arises from the intrinsic structure of HSI representations, rather than any particular task setting. To validate this, we apply FairHyp across four representative tasks including classification, denoising, super-resolution, and inpaintin, demonstrating its effectiveness in modeling a shared structural flaw. Extensive experiments show that FairHyp consistently outperforms state-of-the-art methods under varied imaging conditions. Our findings redefine fairness as a structural necessity in HSI modeling and offer a new paradigm for balancing adaptability, efficiency, and fidelity in high-dimensional vision tasks.

Title: Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

Authors: Camille Couturier, Spyros Mastorakis, Haiying Shen, Saravan Rajmohan, Victor Rühle
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11271
Pdf URL: https://arxiv.org/pdf/2505.11271
Copy Paste: [[2505.11271]] Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models(https://arxiv.org/abs/2505.11271)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

Title: Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

Authors: Yaorui Shi, Shihan Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11277
Pdf URL: https://arxiv.org/pdf/2505.11277
Copy Paste: [[2505.11277]] Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs(https://arxiv.org/abs/2505.11277)
Keywords: large language model
Abstract: Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

Title: Temporal fine-tuning for early risk detection

Authors: Horacio Thompson, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Marcelo Errecalde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11280
Pdf URL: https://arxiv.org/pdf/2505.11280
Copy Paste: [[2505.11280]] Temporal fine-tuning for early risk detection(https://arxiv.org/abs/2505.11280)
Keywords: transformer
Abstract: Early Risk Detection (ERD) on the Web aims to identify promptly users facing social and health issues. Users are analyzed post-by-post, and it is necessary to guarantee correct and quick answers, which is particularly challenging in critical scenarios. ERD involves optimizing classification precision and minimizing detection delay. Standard classification metrics may not suffice, resorting to specific metrics such as ERDE(theta) that explicitly consider precision and delay. The current research focuses on applying a multi-objective approach, prioritizing classification performance and establishing a separate criterion for decision time. In this work, we propose a completely different strategy, temporal fine-tuning, which allows tuning transformer-based models by explicitly incorporating time within the learning process. Our method allows us to analyze complete user post histories, tune models considering different contexts, and evaluate training performance using temporal metrics. We evaluated our proposal in the depression and eating disorders tasks for the Spanish language, achieving competitive results compared to the best models of MentalRiskES 2023. We found that temporal fine-tuning optimized decisions considering context and time progress. In this way, by properly taking advantage of the power of transformers, it is possible to address ERD by combining precision and speed as a single objective.

Title: Bidirectional Information Flow (BIF) -- A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization

Authors: Juan D. Guerra (1 and 3), Thomas Garbay (1 and 3), Guillaume Lajoie (2 and 3), Marco Bonizzato (1, 2 and 3) ((1) Polytechnique Montréal, (2) Université de Montréal, (3) Mila - Quebec Artificial Intelligence Institute)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11294
Pdf URL: https://arxiv.org/pdf/2505.11294
Copy Paste: [[2505.11294]] Bidirectional Information Flow (BIF) -- A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization(https://arxiv.org/abs/2505.11294)
Keywords: robust
Abstract: Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, making them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 85% and 5x higher $R^2$ scores for the parent and children respectively, on synthetic and real-world neurostimulation optimization tasks.

Title: Probing Subphonemes in Morphology Models

Authors: Gal Astrach, Yuval Pinter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11297
Pdf URL: https://arxiv.org/pdf/2505.11297
Copy Paste: [[2505.11297]] Probing Subphonemes in Morphology Models(https://arxiv.org/abs/2505.11297)
Keywords: transformer
Abstract: Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the degree to which these models are able to capture implicit phenomena at the phonological and subphonemic levels. We introduce a language-agnostic probing method to investigate phonological feature encoding in transformers trained directly on phonemes, and perform it across seven morphologically diverse languages. We show that phonological features which are local, such as final-obstruent devoicing in Turkish, are captured well in phoneme embeddings, whereas long-distance dependencies like vowel harmony are better represented in the transformer's encoder. Finally, we discuss how these findings inform empirical strategies for training morphological models, particularly regarding the role of subphonemic feature acquisition.

Title: Heterogeneity-Aware Client Sampling: A Unified Solution for Consistent Federated Learning

Authors: Shudi Weng, Chao Ren, Ming Xiao, Mikael Skoglund
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11304
Pdf URL: https://arxiv.org/pdf/2505.11304
Copy Paste: [[2505.11304]] Heterogeneity-Aware Client Sampling: A Unified Solution for Consistent Federated Learning(https://arxiv.org/abs/2505.11304)
Keywords: federate
Abstract: Federated learning (FL) commonly involves clients with diverse communication and computational capabilities. Such heterogeneity can significantly distort the optimization dynamics and lead to objective inconsistency, where the global model converges to an incorrect stationary point potentially far from the pursued optimum. Despite its critical impact, the joint effect of communication and computation heterogeneity has remained largely unexplored, due to the intrinsic complexity of their interaction. In this paper, we reveal the fundamentally distinct mechanisms through which heterogeneous communication and computation drive inconsistency in FL. To the best of our knowledge, this is the first unified theoretical analysis of general heterogeneous FL, offering a principled understanding of how these two forms of heterogeneity jointly distort the optimization trajectory under arbitrary choices of local solvers. Motivated by these insights, we propose Federated Heterogeneity-Aware Client Sampling, FedACS, a universal method to eliminate all types of objective inconsistency. We theoretically prove that FedACS converges to the correct optimum at a rate of $O(1/\sqrt{R})$, even in dynamic heterogeneous environments. Extensive experiments across multiple datasets show that FedACS outperforms state-of-the-art and category-specific baselines by 4.3%-36%, while reducing communication costs by 22%-89% and computation loads by 14%-105%, respectively.

Title: Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion

Authors: Xinyan Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11306
Pdf URL: https://arxiv.org/pdf/2505.11306
Copy Paste: [[2505.11306]] Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion(https://arxiv.org/abs/2505.11306)
Keywords: diffusion
Abstract: We propose the Fourier Adaptive Lite Diffusion Architecture (FALDA), a novel probabilistic framework for time series forecasting. First, we introduce the Diffusion Model for Residual Regression (DMRR) framework, which unifies diffusion-based probabilistic regression methods. Within this framework, FALDA leverages Fourier-based decomposition to incorporate a component-specific architecture, enabling tailored modeling of individual temporal components. A conditional diffusion model is utilized to estimate the future noise term, while our proposed lightweight denoiser, DEMA (Decomposition MLP with AdaLN), conditions on the historical noise term to enhance denoising performance. Through mathematical analysis and empirical validation, we demonstrate that FALDA effectively reduces epistemic uncertainty, allowing probabilistic learning to primarily focus on aleatoric uncertainty. Experiments on six real-world benchmarks demonstrate that FALDA consistently outperforms existing probabilistic forecasting approaches across most datasets for long-term time series forecasting while achieving enhanced computational efficiency without compromising accuracy. Notably, FALDA also achieves superior overall performance compared to state-of-the-art (SOTA) point forecasting approaches, with improvements of up to 9%.

Title: Diffusion Learning with Partial Agent Participation and Local Updates

Authors: Elsa Rizk, Kun Yuan, Ali H. Sayed
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11307
Pdf URL: https://arxiv.org/pdf/2505.11307
Copy Paste: [[2505.11307]] Diffusion Learning with Partial Agent Participation and Local Updates(https://arxiv.org/abs/2505.11307)
Keywords: privacy, protect, diffusion
Abstract: Diffusion learning is a framework that endows edge devices with advanced intelligence. By processing and analyzing data locally and allowing each agent to communicate with its immediate neighbors, diffusion effectively protects the privacy of edge devices, enables real-time response, and reduces reliance on central servers. However, traditional diffusion learning relies on communication at every iteration, leading to communication overhead, especially with large learning models. Furthermore, the inherent volatility of edge devices, stemming from power outages or signal loss, poses challenges to reliable communication between neighboring agents. To mitigate these issues, this paper investigates an enhanced diffusion learning approach incorporating local updates and partial agent participation. Local updates will curtail communication frequency, while partial agent participation will allow for the inclusion of agents based on their availability. We prove that the resulting algorithm is stable in the mean-square error sense and provide a tight analysis of its Mean-Square-Deviation (MSD) performance. Various numerical experiments are conducted to illustrate our theoretical findings.

Title: CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Authors: Christoph Leiter, Yuki M. Asano, Margret Keuper, Steffen Eger
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.11314
Pdf URL: https://arxiv.org/pdf/2505.11314
Copy Paste: [[2505.11314]] CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks(https://arxiv.org/abs/2505.11314)
Keywords: robust
Abstract: The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.

Title: Anomaly Detection for Non-stationary Time Series using Recurrent Wavelet Probabilistic Neural Network

Authors: Pu Yang, J. A. Barria
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.11321
Pdf URL: https://arxiv.org/pdf/2505.11321
Copy Paste: [[2505.11321]] Anomaly Detection for Non-stationary Time Series using Recurrent Wavelet Probabilistic Neural Network(https://arxiv.org/abs/2505.11321)
Keywords: robust
Abstract: In this paper, an unsupervised Recurrent Wavelet Probabilistic Neural Network (RWPNN) is proposed, which aims at detecting anomalies in non-stationary environments by modelling the temporal features using a nonparametric density estimation network. The novel framework consists of two components, a Stacked Recurrent Encoder-Decoder (SREnc-Dec) module that captures temporal features in a latent space, and a Multi-Receptive-field Wavelet Probabilistic Network (MRWPN) that creates an ensemble probabilistic model to characterise the latent space. This formulation extends the standard wavelet probabilistic networks to wavelet deep probabilistic networks, which can handle higher data dimensionality. The MRWPN module can adapt to different rates of data variation in different datasets without imposing strong distribution assumptions, resulting in a more robust and accurate detection for Time Series Anomaly Detection (TSAD) tasks in the non-stationary environment. We carry out the assessment on 45 real-world time series datasets from various domains, verify the performance of RWPNN in TSAD tasks with several constraints, and show its ability to provide early warnings for anomalous events.

Title: MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Authors: Y.B. Wang, S Wang, J.N. Zhang, J.F. Wu, Q.D. He, C.C. Fu, C.J. Wang, Y. Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11334
Pdf URL: https://arxiv.org/pdf/2505.11334
Copy Paste: [[2505.11334]] MARRS: Masked Autoregressive Unit-based Reaction Synthesis(https://arxiv.org/abs/2505.11334)
Keywords: diffusion
Abstract: This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions based on the action sequence of the other as conditions. Currently, autoregressive modeling approaches have achieved remarkable performance in motion generation tasks, e.g. text-to-motion. However, vector quantization (VQ) accompanying autoregressive generation has inherent disadvantages, including loss of quantization information, low codebook utilization, etc. Moreover, unlike text-to-motion, which focuses solely on the movement of body joints, human action-reaction synthesis also encompasses fine-grained hand movements. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions in continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding them independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.

Title: XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision

Authors: Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11336
Pdf URL: https://arxiv.org/pdf/2505.11336
Copy Paste: [[2505.11336]] XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision(https://arxiv.org/abs/2505.11336)
Keywords: large language model
Abstract: Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.

Title: Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Authors: Banca Calvo Figueras, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11341
Pdf URL: https://arxiv.org/pdf/2505.11341
Copy Paste: [[2505.11341]] Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models(https://arxiv.org/abs/2505.11341)
Keywords: large language model
Abstract: The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.

Title: Dynamic Base model Shift for Delta Compression

Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11344
Pdf URL: https://arxiv.org/pdf/2505.11344
Copy Paste: [[2505.11344]] Dynamic Base model Shift for Delta Compression(https://arxiv.org/abs/2505.11344)
Keywords: transformer
Abstract: Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.

Title: Context parroting: A simple but tough-to-beat baseline for foundation models in scientific machine learning

Authors: Yuanzhao Zhang, William Gilpin
Subjects: cs.LG, nlin.CD, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2505.11349
Pdf URL: https://arxiv.org/pdf/2505.11349
Copy Paste: [[2505.11349]] Context parroting: A simple but tough-to-beat baseline for foundation models in scientific machine learning(https://arxiv.org/abs/2505.11349)
Keywords: large language model
Abstract: Recently-developed time series foundation models for scientific machine learning exhibit emergent abilities to predict physical systems. These abilities include zero-shot forecasting, in which a model forecasts future states of a system given only a short trajectory as context. Here, we show that foundation models applied to physical systems can give accurate predictions, but that they fail to develop meaningful representations of the underlying physics. Instead, foundation models often forecast by context parroting, a simple zero-shot forecasting strategy that copies directly from the context. As a result, a naive direct context parroting model scores higher than state-of-the-art time-series foundation models on predicting a diverse range of dynamical systems, at a tiny fraction of the computational cost. We draw a parallel between context parroting and induction heads, which explains why large language models trained on text can be repurposed for time series forecasting. Our dynamical systems perspective also ties the scaling between forecast accuracy and context length to the fractal dimension of the attractor, providing insight into the previously observed in-context neural scaling laws. Context parroting thus serves as a simple but tough-to-beat baseline for future time-series foundation models and can help identify in-context learning strategies beyond parroting.

Title: LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Authors: Rao Ma, Tongzhou Chen, Kartik Audhkhasi, Bhuvana Ramabhadran
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.11352
Pdf URL: https://arxiv.org/pdf/2505.11352
Copy Paste: [[2505.11352]] LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors(https://arxiv.org/abs/2505.11352)
Keywords: large language model
Abstract: Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks. By connecting USM with Gemma models, we can get an average of 49% WERR over the USM-CTC baseline on 8 MLS testsets. The trained model also exhibits modularity in a range of settings -- after fine-tuning the Gemma model weights, the speech encoder can be switched and combined with the LLM in a zero-shot fashion. Additionally, we propose to control the decode-time influence of the USM and LLM using a softmax temperature, which shows effectiveness in domain adaptation.

Title: GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Authors: Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11368
Pdf URL: https://arxiv.org/pdf/2505.11368
Copy Paste: [[2505.11368]] GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents(https://arxiv.org/abs/2505.11368)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.

Title: MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory

Authors: Zifan Wang, Jingwei Li, Yitang Li, Yunze Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11386
Pdf URL: https://arxiv.org/pdf/2505.11386
Copy Paste: [[2505.11386]] MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory(https://arxiv.org/abs/2505.11386)
Keywords: robust
Abstract: This paper introduces MutualNeRF, a framework enhancing Neural Radiance Field (NeRF) performance under limited samples using Mutual Information Theory. While NeRF excels in 3D scene synthesis, challenges arise with limited data and existing methods that aim to introduce prior knowledge lack theoretical support in a unified framework. We introduce a simple but theoretically robust concept, Mutual Information, as a metric to uniformly measure the correlation between images, considering both macro (semantic) and micro (pixel) levels. For sparse view sampling, we strategically select additional viewpoints containing more non-overlapping scene information by minimizing mutual information without knowing ground truth images beforehand. Our framework employs a greedy algorithm, offering a near-optimal solution. For few-shot view synthesis, we maximize the mutual information between inferred images and ground truth, expecting inferred images to gain more relevant information from known images. This is achieved by incorporating efficient, plug-and-play regularization terms. Experiments under limited samples show consistent improvement over state-of-the-art baselines in different settings, affirming the efficacy of our framework.

Title: IISE PG&E Energy Analytics Challenge 2025: Hourly-Binned Regression Models Beat Transformers in Load Forecasting

Authors: Millend Roy, Vladimir Pyltsov, Yinbo Hu
Subjects: cs.LG, econ.EM, eess.SY
Abstract URL: https://arxiv.org/abs/2505.11390
Pdf URL: https://arxiv.org/pdf/2505.11390
Copy Paste: [[2505.11390]] IISE PG&E Energy Analytics Challenge 2025: Hourly-Binned Regression Models Beat Transformers in Load Forecasting(https://arxiv.org/abs/2505.11390)
Keywords: transformer
Abstract: Accurate electricity load forecasting is essential for grid stability, resource optimization, and renewable energy integration. While transformer-based deep learning models like TimeGPT have gained traction in time-series forecasting, their effectiveness in long-term electricity load prediction remains uncertain. This study evaluates forecasting models ranging from classical regression techniques to advanced deep learning architectures using data from the ESD 2025 competition. The dataset includes two years of historical electricity load data, alongside temperature and global horizontal irradiance (GHI) across five sites, with a one-day-ahead forecasting horizon. Since actual test set load values remain undisclosed, leveraging predicted values would accumulate errors, making this a long-term forecasting challenge. We employ (i) Principal Component Analysis (PCA) for dimensionality reduction and (ii) frame the task as a regression problem, using temperature and GHI as covariates to predict load for each hour, (iii) ultimately stacking 24 models to generate yearly forecasts. Our results reveal that deep learning models, including TimeGPT, fail to consistently outperform simpler statistical and machine learning approaches due to the limited availability of training data and exogenous variables. In contrast, XGBoost, with minimal feature engineering, delivers the lowest error rates across all test cases while maintaining computational efficiency. This highlights the limitations of deep learning in long-term electricity forecasting and reinforces the importance of model selection based on dataset characteristics rather than complexity. Our study provides insights into practical forecasting applications and contributes to the ongoing discussion on the trade-offs between traditional and modern forecasting methods.

Title: Finding Counterfactual Evidences for Node Classification

Authors: Dazhuo Qiu, Jinwen Chen, Arijit Khan, Yan Zhao, Francesco Bonchi
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2505.11396
Pdf URL: https://arxiv.org/pdf/2505.11396
Copy Paste: [[2505.11396]] Finding Counterfactual Evidences for Node Classification(https://arxiv.org/abs/2505.11396)
Keywords: fair, interpretability
Abstract: Counterfactual learning is emerging as an important paradigm, rooted in causality, which promises to alleviate common issues of graph neural networks (GNNs), such as fairness and interpretability. However, as in many real-world application domains where conducting randomized controlled trials is impractical, one has to rely on available observational (factual) data to detect counterfactuals. In this paper, we introduce and tackle the problem of searching for counterfactual evidences for the GNN-based node classification task. A counterfactual evidence is a pair of nodes such that, regardless they exhibit great similarity both in the features and in their neighborhood subgraph structures, they are classified differently by the GNN. We develop effective and efficient search algorithms and a novel indexing solution that leverages both node features and structural information to identify counterfactual evidences, and generalizes beyond any specific GNN. Through various downstream applications, we demonstrate the potential of counterfactual evidences to enhance fairness and accuracy of GNNs.

Title: Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

Authors: Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11404
Pdf URL: https://arxiv.org/pdf/2505.11404
Copy Paste: [[2505.11404]] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner(https://arxiv.org/abs/2505.11404)
Keywords: robust
Abstract: Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose PathoCLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both PathoCLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: this https URL.

Title: EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Authors: Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki Kälviäinen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.11405
Pdf URL: https://arxiv.org/pdf/2505.11405
Copy Paste: [[2505.11405]] EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models(https://arxiv.org/abs/2505.11405)
Keywords: robust, large language model
Abstract: Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at this https URL.

Title: Visual Planning: Let's Think Only with Images

Authors: Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.11409
Pdf URL: https://arxiv.org/pdf/2505.11409
Copy Paste: [[2505.11409]] Visual Planning: Let's Think Only with Images(https://arxiv.org/abs/2505.11409)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Title: Is Grokking a Computational Glass Relaxation?

Authors: Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang
Subjects: cs.LG, cond-mat.dis-nn
Abstract URL: https://arxiv.org/abs/2505.11411
Pdf URL: https://arxiv.org/pdf/2505.11411
Copy Paste: [[2505.11411]] Is Grokking a Computational Glass Relaxation?(https://arxiv.org/abs/2505.11411)
Keywords: transformer
Abstract: Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

Title: Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks

Authors: Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, Andrew Thompson
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.11412
Pdf URL: https://arxiv.org/pdf/2505.11412
Copy Paste: [[2505.11412]] Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks(https://arxiv.org/abs/2505.11412)
Keywords: interpretability
Abstract: Photoplethysmography (PPG) signals encode information about relative changes in blood volume that can be used to assess various aspects of cardiac health non-invasively, e.g.\ to detect atrial fibrillation (AF) or predict blood pressure (BP). Deep networks are well-equipped to handle the large quantities of data acquired from wearable measurement devices. However, they lack interpretability and are prone to overfitting, leaving considerable risk for poor performance on unseen data and misdiagnosis. Here, we describe the use of two scalable uncertainty quantification techniques: Monte Carlo Dropout and the recently proposed Improved Variational Online Newton. These techniques are used to assess the trustworthiness of models trained to perform AF classification and BP regression from raw PPG time series. We find that the choice of hyperparameters has a considerable effect on the predictive performance of the models and on the quality and composition of predicted uncertainties. E.g. the stochasticity of the model parameter sampling determines the proportion of the total uncertainty that is aleatoric, and has varying effects on predictive performance and calibration quality dependent on the chosen uncertainty quantification technique and the chosen expression of uncertainty. We find significant discrepancy in the quality of uncertainties over the predicted classes, emphasising the need for a thorough evaluation protocol that assesses local and adaptive calibration. This work suggests that the choice of hyperparameters must be carefully tuned to balance predictive performance and calibration quality, and that the optimal parameterisation may vary depending on the chosen expression of uncertainty.

Title: CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Authors: Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, Chen-Hsiang Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11413
Pdf URL: https://arxiv.org/pdf/2505.11413
Copy Paste: [[2505.11413]] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs(https://arxiv.org/abs/2505.11413)
Keywords: attack, robust, large language model
Abstract: Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.

Title: MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2505.11415
Pdf URL: https://arxiv.org/pdf/2505.11415
Copy Paste: [[2505.11415]] MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems(https://arxiv.org/abs/2505.11415)
Keywords: large language model
Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

Title: When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Authors: Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, Anurag Beniwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11423
Pdf URL: https://arxiv.org/pdf/2505.11423
Copy Paste: [[2505.11423]] When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs(https://arxiv.org/abs/2505.11423)
Keywords: large language model
Abstract: Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

Title: Improving Object Detection Performance through YOLOv8: A Comprehensive Training and Evaluation Study

Authors: Rana Poureskandar, Shiva Razzagzadeh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11424
Pdf URL: https://arxiv.org/pdf/2505.11424
Copy Paste: [[2505.11424]] Improving Object Detection Performance through YOLOv8: A Comprehensive Training and Evaluation Study(https://arxiv.org/abs/2505.11424)
Keywords: segmentation
Abstract: This study evaluated the performance of a YOLOv8-based segmentation model for detecting and segmenting wrinkles in facial images.

Title: MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2505.11432
Pdf URL: https://arxiv.org/pdf/2505.11432
Copy Paste: [[2505.11432]] MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production(https://arxiv.org/abs/2505.11432)
Keywords: large language model
Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

Title: GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Authors: Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11436
Pdf URL: https://arxiv.org/pdf/2505.11436
Copy Paste: [[2505.11436]] GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art(https://arxiv.org/abs/2505.11436)
Keywords: large language model
Abstract: Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at this https URL.

Title: SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision

Authors: Utsav Rai, Haozheng Xu, Stamatia Giannarou
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.11439
Pdf URL: https://arxiv.org/pdf/2505.11439
Copy Paste: [[2505.11439]] SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision(https://arxiv.org/abs/2505.11439)
Keywords: robust, segmentation
Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.

Title: Is Compression Really Linear with Code Intelligence?

Authors: Xianzhen Luo, Shijie Xuyang, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11441
Pdf URL: https://arxiv.org/pdf/2505.11441
Copy Paste: [[2505.11441]] Is Compression Really Linear with Code Intelligence?(https://arxiv.org/abs/2505.11441)
Keywords: robust, fair, large language model
Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.

Title: A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation

Authors: Xinran Song, Tianyu Chen, Mingyuan Zhou
Subjects: cs.LG, stat.AP, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2505.11444
Pdf URL: https://arxiv.org/pdf/2505.11444
Copy Paste: [[2505.11444]] A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation(https://arxiv.org/abs/2505.11444)
Keywords: diffusion, generative
Abstract: Estimating individualized treatment effects from observational data is a central challenge in causal inference, largely due to covariate imbalance and confounding bias from non-randomized treatment assignment. While inverse probability weighting (IPW) is a well-established solution to this problem, its integration into modern deep learning frameworks remains limited. In this work, we propose Importance-Weighted Diffusion Distillation (IWDD), a novel generative framework that combines the pretraining of diffusion models with importance-weighted score distillation to enable accurate and fast causal estimation-including potential outcome prediction and treatment effect estimation. We demonstrate how IPW can be naturally incorporated into the distillation of pretrained diffusion models, and further introduce a randomization-based adjustment that eliminates the need to compute IPW explicitly-thereby simplifying computation and, more importantly, provably reducing the variance of gradient estimates. Empirical results show that IWDD achieves state-of-the-art out-of-sample prediction performance, with the highest win rates compared to other baselines, significantly improving causal estimation and supporting the development of individualized treatment strategies. We will release our PyTorch code for reproducibility and future research.

Title: LLMs unlock new paths to monetizing exploits

Authors: Nicholas Carlini, Milad Nasr, Edoardo Debenedetti, Barry Wang, Christopher A. Choquette-Choo, Daphne Ippolito, Florian Tramèr, Matthew Jagielski
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11449
Pdf URL: https://arxiv.org/pdf/2505.11449
Copy Paste: [[2505.11449]] LLMs unlock new paths to monetizing exploits(https://arxiv.org/abs/2505.11449)
Keywords: defense, attack, fair, large language model
Abstract: We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-identify bug in a product with millions of users, LLMs can find thousands of easy-to-identify bugs in products with thousands of users. And on the monetization front, instead of generic ransomware that always performs the same attack (encrypt all your data and request payment to decrypt), an LLM-driven ransomware attack could tailor the ransom demand based on the particular content of each exploited device. We show that these two attacks (and several others) are imminently practical using state-of-the-art LLMs. For example, we show that without any human intervention, an LLM finds highly sensitive personal information in the Enron email dataset (e.g., an executive having an affair with another employee) that could be used for blackmail. While some of our attacks are still too expensive to scale widely today, the incentives to implement these attacks will only increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new defense-in-depth approaches.

Title: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Authors: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11454
Pdf URL: https://arxiv.org/pdf/2505.11454
Copy Paste: [[2505.11454]] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation(https://arxiv.org/abs/2505.11454)
Keywords: robust, fair
Abstract: Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: this https URL

Title: ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks

Authors: Zhixiong Zhuang, Maria-Irina Nicolae, Hui-Po Wang, Mario Fritz
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.11459
Pdf URL: https://arxiv.org/pdf/2505.11459
Copy Paste: [[2505.11459]] ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks(https://arxiv.org/abs/2505.11459)
Keywords: security, protect, defense, attack, extraction, large language model
Abstract: The integration of large language models (LLMs) into a wide range of applications has highlighted the critical role of well-crafted system prompts, which require extensive testing and domain expertise. These prompts enhance task performance but may also encode sensitive information and filtering criteria, posing security risks if exposed. Recent research shows that system prompts are vulnerable to extraction attacks, while existing defenses are either easily bypassed or require constant updates to address new threats. In this work, we introduce ProxyPrompt, a novel defense mechanism that prevents prompt leakage by replacing the original prompt with a proxy. This proxy maintains the original task's utility while obfuscating the extracted prompt, ensuring attackers cannot reproduce the task or access sensitive information. Comprehensive evaluations on 264 LLM and system prompt pairs show that ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming the next-best defense, which only achieves 42.80%.

Title: Disentangling Reasoning and Knowledge in Medical Large Language Models

Authors: Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11462
Pdf URL: https://arxiv.org/pdf/2505.11462
Copy Paste: [[2505.11462]] Disentangling Reasoning and Knowledge in Medical Large Language Models(https://arxiv.org/abs/2505.11462)
Keywords: robust, large language model
Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

Title: PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Authors: Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11468
Pdf URL: https://arxiv.org/pdf/2505.11468
Copy Paste: [[2505.11468]] PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment(https://arxiv.org/abs/2505.11468)
Keywords: diffusion
Abstract: Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.

Title: No Gold Standard, No Problem: Reference-Free Evaluation of Taxonomies

Authors: Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11470
Pdf URL: https://arxiv.org/pdf/2505.11470
Copy Paste: [[2505.11470]] No Gold Standard, No Problem: Reference-Free Evaluation of Taxonomies(https://arxiv.org/abs/2505.11470)
Keywords: robust
Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, covering a type of error not handled by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against gold-standard taxonomies.

Title: HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, Oleksii Kuchaiev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11475
Pdf URL: https://arxiv.org/pdf/2505.11475
Copy Paste: [[2505.11475]] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages(https://arxiv.org/abs/2505.11475)
Keywords: generative, large language model
Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): this https URL

Title: Improving Assembly Code Performance with Large Language Models via Reinforcement Learning

Authors: Anjiang Wei, Tarun Suresh, Huanmi Tan, Yinglun Xu, Gagandeep Singh, Ke Wang, Alex Aiken
Subjects: cs.CL, cs.AI, cs.PF, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.11480
Pdf URL: https://arxiv.org/pdf/2505.11480
Copy Paste: [[2505.11480]] Improving Assembly Code Performance with Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2505.11480)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored. This work investigates whether LLMs can optimize the performance of assembly code, where fine-grained control over execution enables improvements that are difficult to express in high-level languages. We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO), guided by a reward function that considers both functional correctness, validated through test cases, and execution performance relative to the industry-standard compiler gcc -O3. To support this study, we introduce a benchmark of 8,072 real-world programs. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline, outperforming all 20 other models evaluated, including Claude-3.7-sonnet. These results indicate that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.

Title: Unsupervised Detection of Distribution Shift in Inverse Problems using Diffusion Models

Authors: Shirin Shoushtari, Edward P. Chandler, Yuanhao Wang, M. Salman Asif, Ulugbek S. Kamilov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11482
Pdf URL: https://arxiv.org/pdf/2505.11482
Copy Paste: [[2505.11482]] Unsupervised Detection of Distribution Shift in Inverse Problems using Diffusion Models(https://arxiv.org/abs/2505.11482)
Keywords: diffusion
Abstract: Diffusion models are widely used as priors in imaging inverse problems. However, their performance often degrades under distribution shifts between the training and test-time images. Existing methods for identifying and quantifying distribution shifts typically require access to clean test images, which are almost never available while solving inverse problems (at test time). We propose a fully unsupervised metric for estimating distribution shifts using only indirect (corrupted) measurements and score functions from diffusion models trained on different datasets. We theoretically show that this metric estimates the KL divergence between the training and test image distributions. Empirically, we show that our score-based metric, using only corrupted measurements, closely approximates the KL divergence computed from clean images. Motivated by this result, we show that aligning the out-of-distribution score with the in-distribution score -- using only corrupted measurements -- reduces the KL divergence and leads to improved reconstruction quality across multiple inverse problems.

Title: msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML

Authors: Zhaolan Huang, Emmanuel Baccelli
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2505.11483
Pdf URL: https://arxiv.org/pdf/2505.11483
Copy Paste: [[2505.11483]] msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML(https://arxiv.org/abs/2505.11483)
Keywords: large language model
Abstract: AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.

Title: Modeling cognitive processes of natural reading with transformer-based Language Models

Authors: Bruno Bianchi, Fermín Travi, Juan E. Kamienkowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11485
Pdf URL: https://arxiv.org/pdf/2505.11485
Copy Paste: [[2505.11485]] Modeling cognitive processes of natural reading with transformer-based Language Models(https://arxiv.org/abs/2505.11485)
Keywords: transformer
Abstract: Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.

Title: QVGen: Pushing the Limit of Quantized Video Generative Models

Authors: Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11497
Pdf URL: https://arxiv.org/pdf/2505.11497
Copy Paste: [[2505.11497]] QVGen: Pushing the Limit of Quantized Video Generative Models(https://arxiv.org/abs/2505.11497)
Keywords: diffusion, generative
Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a rank-decay strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3$B $\sim14$B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.