2025-05-27

Title: InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models

Authors: Austin Howard
Subjects: cs.CR, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.18156
Pdf URL: https://arxiv.org/pdf/2505.18156
Copy Paste: [[2505.18156]] InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models(https://arxiv.org/abs/2505.18156)
Keywords: security, attack, large language model
Abstract: Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.

Title: A Blockchain-Based Approach for Secure and Transparent e-Faktur Issuance in Indonesia's VAT Reporting System

Authors: G. L. Farchan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18157
Pdf URL: https://arxiv.org/pdf/2505.18157
Copy Paste: [[2505.18157]] A Blockchain-Based Approach for Secure and Transparent e-Faktur Issuance in Indonesia's VAT Reporting System(https://arxiv.org/abs/2505.18157)
Keywords: secure, security, attack, robust
Abstract: The implementation of blockchain technology in tax administration offers promising improvements in security, transparency, and efficiency. This paper presents the design of a blockchain-based e-Faktur system aimed at addressing the challenges of issuing and verifying tax invoices within Indonesia's VAT reporting process. Utilizing Hyperledger Fabric, a private permissioned blockchain, integrated with Hyperledger FireFly, this system ensures data immutability, access control, and decentralized transaction processing. The proposed system streamlines the issuance of NSFP and validates invoices while reducing reliance on centralized servers, eliminating single points of failure, and enhancing auditability. Hypothetical cyberattack scenarios were explored to assess the system's robustness. The results demonstrate that blockchain technology mitigates potential vulnerabilities and improves the resilience of tax reporting systems. The findings highlight the feasibility of blockchain in public tax administration and provide a foundation for future research on large-scale implementations.

Title: Model-Distributed Inference for Large Language Models at the Edge

Authors: Davide Macario, Hulya Seferoglu, Erdem Koyuncu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18164
Pdf URL: https://arxiv.org/pdf/2505.18164
Copy Paste: [[2505.18164]] Model-Distributed Inference for Large Language Models at the Edge(https://arxiv.org/abs/2505.18164)
Keywords: large language model
Abstract: We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.

Title: Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression

Authors: Jacob Sander, David Moe, Achraf Cohen, Brent Venable, Venkat Dasari, Brian Jalaian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18166
Pdf URL: https://arxiv.org/pdf/2505.18166
Copy Paste: [[2505.18166]] Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression(https://arxiv.org/abs/2505.18166)
Keywords: transformer
Abstract: Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test accuracy, demonstrating that, even with a basic MLP-only pruning, the choice of loss function materially affects compressed model recovery in resource-constrained environments.

Title: Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

Authors: Feifan Wang, Tengfei Song, Minggui He, Chang Su, Zhanglin Wu, Hao Yang, Wenming Zheng, Osamu Yoshie
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2505.18168
Pdf URL: https://arxiv.org/pdf/2505.18168
Copy Paste: [[2505.18168]] Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation(https://arxiv.org/abs/2505.18168)
Keywords: large language model
Abstract: Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.

Title: Interpretable Multi-Task PINN for Emotion Recognition and EDA Prediction

Authors: Nischal Mandal
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.18169
Pdf URL: https://arxiv.org/pdf/2505.18169
Copy Paste: [[2505.18169]] Interpretable Multi-Task PINN for Emotion Recognition and EDA Prediction(https://arxiv.org/abs/2505.18169)
Keywords: interpretability
Abstract: Understanding and predicting human emotional and physiological states using wearable sensors has important applications in stress monitoring, mental health assessment, and affective computing. This study presents a novel Multi-Task Physics-Informed Neural Network (PINN) that performs Electrodermal Activity (EDA) prediction and emotion classification simultaneously, using the publicly available WESAD dataset. The model integrates psychological self-report features (PANAS and SAM) with a physics-inspired differential equation representing EDA dynamics, enforcing biophysically grounded constraints through a custom loss function. This loss combines EDA regression, emotion classification, and a physics residual term for improved interpretability. The architecture supports dual outputs for both tasks and is trained under a unified multi-task framework. Evaluated using 5-fold cross-validation, the model achieves an average EDA RMSE of 0.0362, Pearson correlation of 0.9919, and F1-score of 94.08 percent. These results outperform classical models such as SVR and XGBoost, as well as ablated variants like emotion-only and EDA-only models. In addition, the learned physical parameters including decay rate (alpha_0), emotional sensitivity (beta), and time scaling (gamma) are interpretable and stable across folds, aligning with known principles of human physiology. This work is the first to introduce a multi-task PINN framework for wearable emotion recognition, offering improved performance, generalizability, and model transparency. The proposed system provides a foundation for future interpretable and multimodal applications in healthcare and human-computer interaction.

Title: Robust Knowledge Graph Embedding via Denoising

Authors: Tengwei Song, Xudong Ma, Yang Liu, Jie Luo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18171
Pdf URL: https://arxiv.org/pdf/2505.18171
Copy Paste: [[2505.18171]] Robust Knowledge Graph Embedding via Denoising(https://arxiv.org/abs/2505.18171)
Keywords: robust
Abstract: We focus on obtaining robust knowledge graph embedding under perturbation in the embedding space. To address these challenges, we introduce a novel framework, Robust Knowledge Graph Embedding via Denoising, which enhances the robustness of KGE models on noisy triples. By treating KGE methods as energy-based models, we leverage the established connection between denoising and score matching, enabling the training of a robust denoising KGE model. Furthermore, we propose certified robustness evaluation metrics for KGE methods based on the concept of randomized smoothing. Through comprehensive experiments on benchmark datasets, our framework consistently shows superior performance compared to existing state-of-the-art KGE methods when faced with perturbed entity embedding.

Title: GenAI Security: Outsmarting the Bots with a Proactive Testing Framework

Authors: Sunil Kumar Jang Bahadur, Gopala Dhar, Lavi Nigam
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18172
Pdf URL: https://arxiv.org/pdf/2505.18172
Copy Paste: [[2505.18172]] GenAI Security: Outsmarting the Bots with a Proactive Testing Framework(https://arxiv.org/abs/2505.18172)
Keywords: security, attack, generative
Abstract: The increasing sophistication and integration of Generative AI (GenAI) models into diverse applications introduce new security challenges that traditional methods struggle to address. This research explores the critical need for proactive security measures to mitigate the risks associated with malicious exploitation of GenAI systems. We present a framework encompassing key approaches, tools, and strategies designed to outmaneuver even advanced adversarial attacks, emphasizing the importance of securing GenAI innovation against potential liabilities. We also empirically prove the effectiveness of the said framework by testing it against the SPML Chatbot Prompt Injection Dataset. This work highlights the shift from reactive to proactive security practices essential for the safe and responsible deployment of GenAI technologies

Title: FedGRec: Dynamic Spatio-Temporal Federated Graph Learning for Secure and Efficient Cross-Border Recommendations

Authors: Zhizhong Tan, Jiexin Zheng, Xingxing Yang, Chi Zhang, Weiping Deng, Wenyong Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18177
Pdf URL: https://arxiv.org/pdf/2505.18177
Copy Paste: [[2505.18177]] FedGRec: Dynamic Spatio-Temporal Federated Graph Learning for Secure and Efficient Cross-Border Recommendations(https://arxiv.org/abs/2505.18177)
Keywords: secure, security, privacy, protect, federate
Abstract: Due to the highly sensitive nature of certain data in cross-border sharing, collaborative cross-border recommendations and data sharing are often subject to stringent privacy protection regulations, resulting in insufficient data for model training. Consequently, achieving efficient cross-border business recommendations while ensuring privacy security poses a significant challenge. Although federated learning has demonstrated broad potential in collaborative training without exposing raw data, most existing federated learning-based GNN training methods still rely on federated averaging strategies, which perform suboptimally on highly heterogeneous graph data. To address this issue, we propose FedGRec, a privacy-preserving federated graph learning method for cross-border recommendations. FedGRec captures user preferences from distributed multi-domain data to enhance recommendation performance across all domains without privacy leakage. Specifically, FedGRec leverages collaborative signals from local subgraphs associated with users or items to enrich their representation learning. Additionally, it employs dynamic spatiotemporal modeling to integrate global and local user preferences in real time based on business recommendation states, thereby deriving the final representations of target users and candidate items. By automatically filtering relevant behaviors, FedGRec effectively mitigates noise interference from unreliable neighbors. Furthermore, through a personalized federated aggregation strategy, FedGRec adapts global preferences to heterogeneous domain data, enabling collaborative learning of user preferences across multiple domains. Extensive experiments on three datasets demonstrate that FedGRec consistently outperforms competitive single-domain and cross-domain baselines while effectively preserving data privacy in cross-border recommendations.

Title: GAIA: A Foundation Model for Operational Atmospheric Dynamics

Authors: Ata Akbari Asanjan, Olivia Alexander, Tom Berg, Clara Zhang, Matt Yang, Jad Makki, Disha Shidham, Srija Chakraborty, William Bender, Stephen Peng, Arun Ravindran, Olivier Raiman, David Potere, David Bell
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18179
Pdf URL: https://arxiv.org/pdf/2505.18179
Copy Paste: [[2505.18179]] GAIA: A Foundation Model for Operational Atmospheric Dynamics(https://arxiv.org/abs/2505.18179)
Keywords: robust
Abstract: We present the GAIA (Geospatial Artificial Intelligence for Atmospheres) Foundation Model, a novel model that combines masked autoencoders (MAE) and self-DIstillation with NO labels (DINO) for analyzing global atmospheric patterns in satellite imagery. By integrating these complementary self-supervised learning approaches, our model simultaneously captures both local features and global dependencies. We address two critical challenges in satellite data analysis: reconstructing missing regions and estimating precipitation patterns as our first downstream tasks. The model demonstrates superior temporal pattern capture compared to standard MAE approaches, while maintaining robust performance in downstream tasks. Our experimental results show strong gap-filling capabilities across varying mask ratios and accurate precipitation estimation with limited training data, achieving a false alarm ratio of 0.088 and structural similarity of 0.881. This work represents an advancement in self-supervised learning for atmospheric science, providing a foundation for improved weather monitoring and climate analysis. The trained model weights and accompanying code are publicly available as open-source on Hugging Face here: this https URL.

Title: 2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision

Authors: Yunrui Li, Hao Xu, Pengyu Hong
Subjects: cs.LG, cs.AI, cs.CE, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2505.18181
Pdf URL: https://arxiv.org/pdf/2505.18181
Copy Paste: [[2505.18181]] 2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision(https://arxiv.org/abs/2505.18181)
Keywords: transformer
Abstract: Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model's ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.

Title: Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry

Authors: Antoine Collas, Ce Ju, Nicolas Salvy, Bertrand Thirion
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18193
Pdf URL: https://arxiv.org/pdf/2505.18193
Copy Paste: [[2505.18193]] Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry(https://arxiv.org/abs/2505.18193)
Keywords: generative
Abstract: Generating realistic brain connectivity matrices is key to analyzing population heterogeneity in brain organization, understanding disease, and augmenting data in challenging classification problems. Functional connectivity matrices lie in constrained spaces--such as the set of symmetric positive definite or correlation matrices--that can be modeled as Riemannian manifolds. However, using Riemannian tools typically requires redefining core operations (geodesics, norms, integration), making generative modeling computationally inefficient. In this work, we propose DiffeoCFM, an approach that enables conditional flow matching (CFM) on matrix manifolds by exploiting pullback metrics induced by global diffeomorphisms on Euclidean spaces. We show that Riemannian CFM with such metrics is equivalent to applying standard CFM after data transformation. This equivalence allows efficient vector field learning, and fast sampling with standard ODE solvers. We instantiate DiffeoCFM with two different settings: the matrix logarithm for covariance matrices and the normalized Cholesky decomposition for correlation matrices. We evaluate DiffeoCFM on three large-scale fMRI datasets with more than 4600 scans from 2800 subjects (ADNI, ABIDE, OASIS-3) and two EEG motor imagery datasets with over 30000 trials from 26 subjects (BNCI2014-002 and BNCI2015-001). It enables fast training and achieves state-of-the-art performance, all while preserving manifold constraints.

Title: Quantum-Resilient Blockchain for Secure Transactions in UAV-Assisted Smart Agriculture Networks

Authors: Taimoor Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18206
Pdf URL: https://arxiv.org/pdf/2505.18206
Copy Paste: [[2505.18206]] Quantum-Resilient Blockchain for Secure Transactions in UAV-Assisted Smart Agriculture Networks(https://arxiv.org/abs/2505.18206)
Keywords: secure, security, attack
Abstract: The integration of unmanned aerial vehicles (UAVs) into smart agriculture has enabled real-time monitoring, data collection, and automated farming operations. However, the high mobility, decentralized nature, and low-power communication of UAVs pose significant security challenges, particularly in ensuring transaction integrity and trust. This paper presents a quantum-resilient blockchain framework designed to secure data and resource transactions in UAV-assisted smart agriculture networks. The proposed solution incorporates post-quantum cryptographic primitives-specifically lattice-based digital signatures and key encapsulation mechanisms to achieve tamper-proof, low-latency consensus without relying on traditional computationally intensive proof-of-work schemes. A lightweight consensus protocol tailored for UAV communication constraints is developed, and transaction validation is handled through a trust-ranked, multi-layer ledger maintained by edge nodes. Experimental results from simulations using NS-3 and custom blockchain testbeds show that the framework outperforms existing schemes in terms of transaction throughput, energy efficiency, and resistance to quantum attacks. The proposed system provides a scalable, secure, and sustainable solution for precision agriculture, enabling trusted automation and resilient data sharing in post-quantum eras.

Title: CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

Authors: Shuhang Xu, Fangwei Zhong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18218
Pdf URL: https://arxiv.org/pdf/2505.18218
Copy Paste: [[2505.18218]] CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games(https://arxiv.org/abs/2505.18218)
Keywords: large language model
Abstract: Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.

Title: IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Authors: Hanyu Li, Haoyu Liu, Tingyu Zhu, Tianyu Guo, Zeyu Zheng, Xiaotie Deng, Michael I. Jordan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18223
Pdf URL: https://arxiv.org/pdf/2505.18223
Copy Paste: [[2505.18223]] IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis(https://arxiv.org/abs/2505.18223)
Keywords: large language model
Abstract: Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.

Title: Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18227
Pdf URL: https://arxiv.org/pdf/2505.18227
Copy Paste: [[2505.18227]] Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality(https://arxiv.org/abs/2505.18227)
Keywords: robust, interpretability, transformer, generative
Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

Title: Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models

Authors: Louis Béthune, David Vigouroux, Yilun Du, Rufin VanRullen, Thomas Serre, Victor Boutin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18230
Pdf URL: https://arxiv.org/pdf/2505.18230
Copy Paste: [[2505.18230]] Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models(https://arxiv.org/abs/2505.18230)
Keywords: generative
Abstract: What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.

Title: NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

Authors: Donghyun Son, Euntae Choi, Sungjoo Yoo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18231
Pdf URL: https://arxiv.org/pdf/2505.18231
Copy Paste: [[2505.18231]] NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache(https://arxiv.org/abs/2505.18231)
Keywords: robust, large language model
Abstract: Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3$\times$ throughput gain over full-precision baselines.

Title: ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning

Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18232
Pdf URL: https://arxiv.org/pdf/2505.18232
Copy Paste: [[2505.18232]] ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning(https://arxiv.org/abs/2505.18232)
Keywords: transformer, large language model
Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.

Title: POSTER: A Multi-Signal Model for Detecting Evasive Smishing

Authors: Shaghayegh Hosseinpour, Sanchari Das
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18233
Pdf URL: https://arxiv.org/pdf/2505.18233
Copy Paste: [[2505.18233]] POSTER: A Multi-Signal Model for Detecting Evasive Smishing(https://arxiv.org/abs/2505.18233)
Keywords: robust
Abstract: Smishing, or SMS-based phishing, poses an increasing threat to mobile users by mimicking legitimate communications through culturally adapted, concise, and deceptive messages, which can result in the loss of sensitive data or financial resources. In such, we present a multi-channel smishing detection model that combines country-specific semantic tagging, structural pattern tagging, character-level stylistic cues, and contextual phrase embeddings. We curated and relabeled over 84,000 messages across five datasets, including 24,086 smishing samples. Our unified architecture achieves 97.89% accuracy, an F1 score of 0.963, and an AUC of 99.73%, outperforming single-stream models by capturing diverse linguistic and structural cues. This work demonstrates the effectiveness of multi-signal learning in robust and region-aware phishing.

Title: A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems

Authors: Yuanya She
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18234
Pdf URL: https://arxiv.org/pdf/2505.18234
Copy Paste: [[2505.18234]] A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems(https://arxiv.org/abs/2505.18234)
Keywords: attack, robust, transformer
Abstract: In this paper, we propose a robust and reinforcement-learning-enhanced network intrusion detection system (NIDS) designed for class-imbalanced and few-shot attack scenarios in Industrial Internet of Things (IIoT) environments. Our model integrates a TabTransformer for effective tabular feature representation with Proximal Policy Optimization (PPO) to optimize classification decisions via policy learning. Evaluated on the TON\textunderscore IoT benchmark, our method achieves a macro F1-score of 97.73\% and accuracy of 98.85\%. Remarkably, even on extremely rare classes like man-in-the-middle (MITM), our model achieves an F1-score of 88.79\%, showcasing strong robustness and few-shot detection capabilities. Extensive ablation experiments confirm the complementary roles of TabTransformer and PPO in mitigating class imbalance and improving generalization. These results highlight the potential of combining transformer-based tabular learning with reinforcement learning for real-world NIDS applications.

Title: The Origins of Representation Manifolds in Large Language Models

Authors: Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18235
Pdf URL: https://arxiv.org/pdf/2505.18235
Copy Paste: [[2505.18235]] The Origins of Representation Manifolds in Large Language Models(https://arxiv.org/abs/2505.18235)
Keywords: interpretability, large language model
Abstract: There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.

Title: Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Authors: Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2505.18237
Pdf URL: https://arxiv.org/pdf/2505.18237
Copy Paste: [[2505.18237]] Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens(https://arxiv.org/abs/2505.18237)
Keywords: large language model
Abstract: The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.

Title: Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

Authors: Ananth Muppidi, Tarak Das, Sambaran Bandyopadhyay, Tripti Shukla, Dharun D A
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18240
Pdf URL: https://arxiv.org/pdf/2505.18240
Copy Paste: [[2505.18240]] Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback(https://arxiv.org/abs/2505.18240)
Keywords: generative, large language model
Abstract: The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.

Title: Privacy-Preserving Bathroom Monitoring for Elderly Emergencies Using PIR and LiDAR Sensors

Authors: Youssouf Sidibé, Julia Gersey
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18242
Pdf URL: https://arxiv.org/pdf/2505.18242
Copy Paste: [[2505.18242]] Privacy-Preserving Bathroom Monitoring for Elderly Emergencies Using PIR and LiDAR Sensors(https://arxiv.org/abs/2505.18242)
Keywords: privacy
Abstract: In-home elderly monitoring requires systems that can detect emergency events - such as falls or prolonged inactivity - while preserving privacy and requiring no user input. These systems must be embedded into the surrounding environment, capable of capturing activity, and responding promptly. This paper presents a low-cost, privacy-preserving solution using Passive Infrared (PIR) and Light Detection and Ranging (LiDAR) sensors to track entries, sitting, exits, and emergency scenarios within a home bathroom setting. We developed and evaluated a rule-based detection system through five real-world experiments simulating elderly behavior. Annotated time-series graphs demonstrate the system's ability to detect dangerous states, such as motionless collapses, while maintaining privacy through non-visual sensing.

Title: Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models

Authors: Yukin Zhang, Qi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18244
Pdf URL: https://arxiv.org/pdf/2505.18244
Copy Paste: [[2505.18244]] Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models(https://arxiv.org/abs/2505.18244)
Keywords: extraction, interpretability, transformer, large language model
Abstract: Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.

Title: Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning

Authors: Roy Elkayam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18245
Pdf URL: https://arxiv.org/pdf/2505.18245
Copy Paste: [[2505.18245]] Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning(https://arxiv.org/abs/2505.18245)
Keywords: interpretability
Abstract: This study presents a novel approach for decomposing urban water demand patterns using Skewed Gaussian Distributions (SGD) to derive behavioral insights and support operational planning. Hourly demand profiles contain critical information for both long-term infrastructure design and daily operations, influencing network pressures, water quality, energy consumption, and overall reliability. By breaking down each daily demand curve into a baseline component and distinct peak components, the proposed SGD method characterizes each peak with interpretable parameters, including peak amplitude, timing (mean), spread (duration), and skewness (asymmetry), thereby reconstructing the observed pattern and uncovering latent usage dynamics. This detailed peak-level decomposition enables both operational applications, e.g. anomaly and leakage detection, real-time demand management, and strategic analyses, e.g. identifying behavioral shifts, seasonal influences, or policy impacts on consumption patterns. Unlike traditional symmetric Gaussian or purely statistical time-series models, SGDs explicitly capture asymmetric peak shapes such as sharp morning surges followed by gradual declines, improving the fidelity of synthetic pattern generation and enhancing the detection of irregular consumption behavior. The method is demonstrated on several real-world datasets, showing that SGD outperforms symmetric Gaussian models in reconstruction accuracy, reducing root-mean-square error by over 50% on average, while maintaining physical interpretability. The SGD framework can also be used to construct synthetic demand scenarios by designing daily peak profiles with chosen characteristics. All implementation code is publicly available at: this https URL

Title: MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning

Authors: Kunal Sawarkar, Shivam R. Solanki, Abhilasha Mangal
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18247
Pdf URL: https://arxiv.org/pdf/2505.18247
Copy Paste: [[2505.18247]] MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning(https://arxiv.org/abs/2505.18247)
Keywords: robust
Abstract: Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other Q&A datasets like SQuAD, NQ etc.

Title: Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks

Authors: Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, Jonathan Love
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18266
Pdf URL: https://arxiv.org/pdf/2505.18266
Copy Paste: [[2505.18266]] Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks(https://arxiv.org/abs/2505.18266)
Keywords: interpretability, transformer
Abstract: We propose a testable universality hypothesis, asserting that seemingly disparate neural network solutions observed in the simple task of modular addition are unified under a common abstract algorithm. While prior work interpreted variations in neuron-level representations as evidence for distinct algorithms, we demonstrate - through multi-level analyses spanning neurons, neuron clusters, and entire networks - that multilayer perceptrons and transformers universally implement the abstract algorithm we call the approximate Chinese Remainder Theorem. Crucially, we introduce approximate cosets and show that neurons activate exclusively on them. Furthermore, our theory works for deep neural networks (DNNs). It predicts that universally learned solutions in DNNs with trainable embeddings or more than one hidden layer require only O(log n) features, a result we empirically confirm. This work thus provides the first theory-backed interpretation of multilayer networks solving modular addition. It advances generalizable interpretability and opens a testable universality hypothesis for group multiplication beyond modular addition.

Title: TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

Authors: Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.18283
Pdf URL: https://arxiv.org/pdf/2505.18283
Copy Paste: [[2505.18283]] TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification(https://arxiv.org/abs/2505.18283)
Keywords: large language model
Abstract: Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at this https URL.

Title: Convexified Message-Passing Graph Neural Networks

Authors: Saar Cohen, Noa Agmon, Uri Shaham
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18289
Pdf URL: https://arxiv.org/pdf/2505.18289
Copy Paste: [[2505.18289]] Convexified Message-Passing Graph Neural Networks(https://arxiv.org/abs/2505.18289)
Keywords: robust
Abstract: Graph Neural Networks (GNNs) have become prominent methods for graph representation learning, demonstrating strong empirical results on diverse graph prediction tasks. In this paper, we introduce Convexified Message Passing Graph Neural Networks (CGNNs), a novel and general framework that combines the power of message-passing GNNs with the tractability of convex optimization. By mapping their nonlinear filters into a reproducing kernel Hilbert space, CGNNs transform training into a convex optimization problem, which can be solved efficiently and optimally by projected gradient methods. This convexity further allows the statistical properties of CGNNs to be analyzed accurately and rigorously. For two-layer CGNNs, we establish rigorous generalization guarantees, showing convergence to the performance of the optimal GNN. To scale to deeper architectures, we adopt a principled layer-wise training strategy. Experiments on benchmark datasets show that CGNNs significantly exceed the performance of leading GNN models, achieving 10 to 40 percent higher accuracy in most cases, underscoring their promise as a powerful and principled method with strong theoretical foundations. In rare cases where improvements are not quantitatively substantial, the convex models either slightly exceed or match the baselines, stressing their robustness and wide applicability. Though over-parameterization is often employed to enhance performance in nonconvex models, we show that our CGNNs framework yields shallow convex models that can surpass these models in both accuracy and resource efficiency.

Title: InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Authors: Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.18291
Pdf URL: https://arxiv.org/pdf/2505.18291
Copy Paste: [[2505.18291]] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning(https://arxiv.org/abs/2505.18291)
Keywords: segmentation
Abstract: Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: this https URL.

Title: Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Authors: Jinyan Su, Claire Cardie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18298
Pdf URL: https://arxiv.org/pdf/2505.18298
Copy Paste: [[2505.18298]] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards(https://arxiv.org/abs/2505.18298)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" -- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.

Title: PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

Authors: Matan Haroush, Daniel Soudry
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18313
Pdf URL: https://arxiv.org/pdf/2505.18313
Copy Paste: [[2505.18313]] PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training(https://arxiv.org/abs/2505.18313)
Keywords: large language model
Abstract: Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.

Title: COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification

Authors: Mariano Rivera, Angello Hoyos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18315
Pdf URL: https://arxiv.org/pdf/2505.18315
Copy Paste: [[2505.18315]] COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification(https://arxiv.org/abs/2505.18315)
Keywords: transformer
Abstract: We introduce the Convolutional Low-Rank Adaptation (CoLoRA) method, designed explicitly to overcome the inefficiencies found in current CNN fine-tuning methods. CoLoRA can be seen as a natural extension of the convolutional architectures of the Low-Rank Adaptation (LoRA) technique. We demonstrate the capabilities of our method by developing and evaluating models using the widely adopted CNN backbone pre-trained on ImageNet. We observed that this strategy results in a stable and accurate coarse-tuning procedure. Moreover, this strategy is computationally efficient and significantly reduces the number of parameters required for fine-tuning compared to traditional methods. Furthermore, our method substantially improves the speed and stability of training. Our case study focuses on classifying retinal diseases from optical coherence tomography (OCT) images, specifically using the OCTMNIST dataset. Experimental results demonstrate that a CNN backbone fine-tuned with CoLoRA surpasses nearly 1\% in accuracy. Such a performance is comparable to the Vision Transformer, State-space discrete, and Kolmogorov-Arnold network models.

Title: Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4

Authors: Zhuozhuo Joy Liu, Farhan Samir, Mehar Bhatia, Laura K. Nelson, Vered Shwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18322
Pdf URL: https://arxiv.org/pdf/2505.18322
Copy Paste: [[2505.18322]] Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4(https://arxiv.org/abs/2505.18322)
Keywords: fair
Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.

Title: Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation

Authors: Nicolas Küchler, Ivan Petrov, Conrad Grobler, Ilia Shumailov
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18323
Pdf URL: https://arxiv.org/pdf/2505.18323
Copy Paste: [[2505.18323]] Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation(https://arxiv.org/abs/2505.18323)
Keywords: privacy, attack, steal, large language model
Abstract: For nearly a decade the academic community has investigated backdoors in neural networks, primarily focusing on classification tasks where adversaries manipulate the model prediction. While demonstrably malicious, the immediate real-world impact of such prediction-altering attacks has remained unclear. In this paper we introduce a novel and significantly more potent class of backdoors that builds upon recent advancements in architectural backdoors. We demonstrate how these backdoors can be specifically engineered to exploit batched inference, a common technique for hardware utilization, enabling large-scale user data manipulation and theft. By targeting the batching process, these architectural backdoors facilitate information leakage between concurrent user requests and allow attackers to fully control model responses directed at other users within the same batch. In other words, an attacker who can change the model architecture can set and steal model inputs and outputs of other users within the same batch. We show that such attacks are not only feasible but also alarmingly effective, can be readily injected into prevalent model architectures, and represent a truly malicious threat to user privacy and system integrity. Critically, to counteract this new class of vulnerabilities, we propose a deterministic mitigation strategy that provides formal guarantees against this new attack vector, unlike prior work that relied on Large Language Models to find the backdoors. Our mitigation strategy employs a novel Information Flow Control mechanism that analyzes the model graph and proves non-interference between different user inputs within the same batch. Using our mitigation strategy we perform a large scale analysis of models hosted through Hugging Face and find over 200 models that introduce (unintended) information leakage between batch entries due to the use of dynamic quantization.

Title: PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Authors: Naghmeh Jamali, Milad Mohammadi, Danial Baledi, Zahra Rezvani, Hesham Faili
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18331
Pdf URL: https://arxiv.org/pdf/2505.18331
Copy Paste: [[2505.18331]] PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language(https://arxiv.org/abs/2505.18331)
Keywords: large language model
Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on this https URL

Title: An Attack to Break Permutation-Based Private Third-Party Inference Schemes for LLMs

Authors: Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18332
Pdf URL: https://arxiv.org/pdf/2505.18332
Copy Paste: [[2505.18332]] An Attack to Break Permutation-Based Private Third-Party Inference Schemes for LLMs(https://arxiv.org/abs/2505.18332)
Keywords: secure, security, privacy, attack, large language model
Abstract: Recent advances in Large Language Models (LLMs) have led to the widespread adoption of third-party inference services, raising critical privacy concerns. Existing methods of performing private third-party inference, such as Secure Multiparty Computation (SMPC), often rely on cryptographic methods. However, these methods are thousands of times slower than standard unencrypted inference, and fail to scale to large modern LLMs. Therefore, recent lines of work have explored the replacement of expensive encrypted nonlinear computations in SMPC with statistical obfuscation methods - in particular, revealing permuted hidden states to the third parties, with accompanying strong claims of the difficulty of reversal into the unpermuted states. In this work, we begin by introducing a novel reconstruction technique that can recover original prompts from hidden states with nearly perfect accuracy across multiple state-of-the-art LLMs. We then show that extensions of our attack are nearly perfectly effective in reversing permuted hidden states of LLMs, demonstrating the insecurity of three recently proposed privacy schemes. We further dissect the shortcomings of prior theoretical `proofs' of permuation security which allow our attack to succeed. Our findings highlight the importance of rigorous security analysis in privacy-preserving LLM inference.

Title: A Critical Evaluation of Defenses against Prompt Injection Attacks

Authors: Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18333
Pdf URL: https://arxiv.org/pdf/2505.18333
Copy Paste: [[2505.18333]] A Critical Evaluation of Defenses against Prompt Injection Attacks(https://arxiv.org/abs/2505.18333)
Keywords: defense, attack, large language model
Abstract: Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: this https URL.

Title: Model Editing with Graph-Based External Memory

Authors: Yash Kumar Atri, Ahmed Alaa, Thomas Hartvigsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18343
Pdf URL: https://arxiv.org/pdf/2505.18343
Copy Paste: [[2505.18343]] Model Editing with Graph-Based External Memory(https://arxiv.org/abs/2505.18343)
Keywords: large language model
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincaré embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) Möbius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.

Title: Sample Complexity of Diffusion Model Training Without Empirical Risk Minimizer Access

Authors: Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, Vaneet Aggarwal
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18344
Pdf URL: https://arxiv.org/pdf/2505.18344
Copy Paste: [[2505.18344]] Sample Complexity of Diffusion Model Training Without Empirical Risk Minimizer Access(https://arxiv.org/abs/2505.18344)
Keywords: diffusion
Abstract: Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-6})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result which achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

Title: Diffusion Self-Weighted Guidance for Offline Reinforcement Learning

Authors: Augusto Tagle, Javier Ruiz-del-Solar, Felipe Tobar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18345
Pdf URL: https://arxiv.org/pdf/2505.18345
Copy Paste: [[2505.18345]] Diffusion Self-Weighted Guidance for Offline Reinforcement Learning(https://arxiv.org/abs/2505.18345)
Keywords: diffusion
Abstract: Offline reinforcement learning (RL) recovers the optimal policy $\pi$ given historical observations of an agent. In practice, $\pi$ is modeled as a weighted version of the agent's behavior policy $\mu$, using a weight function $w$ working as a critic of the agent's behavior. Though recent approaches to offline RL based on diffusion models have exhibited promising results, the computation of the required scores is challenging due to their dependence on the unknown $w$. In this work, we alleviate this issue by constructing a diffusion over both the actions and the weights. With the proposed setting, the required scores are directly obtained from the diffusion model without learning extra networks. Our main conceptual contribution is a novel guidance method, where guidance (which is a function of $w$) comes from the same diffusion model, therefore, our proposal is termed Self-Weighted Guidance (SWG). We show that SWG generates samples from the desired distribution on toy examples and performs on par with state-of-the-art methods on D4RL's challenging environments, while maintaining a streamlined training pipeline. We further validate SWG through ablation studies on weight formulations and scalability.

Title: Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?

Authors: Waleed Reda, Abhinav Jangda, Krishna Chintalapudi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18350
Pdf URL: https://arxiv.org/pdf/2505.18350
Copy Paste: [[2505.18350]] Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?(https://arxiv.org/abs/2505.18350)
Keywords: robust, large language model
Abstract: As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.

Title: The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Authors: Lucas Bandarkar, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18356
Pdf URL: https://arxiv.org/pdf/2505.18356
Copy Paste: [[2505.18356]] The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs(https://arxiv.org/abs/2505.18356)
Keywords: large language model
Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

Title: CONCORD: Concept-Informed Diffusion for Dataset Distillation

Authors: Jianyang Gu, Haonan Wang, Ruoxi Jia, Saeed Vahidian, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18358
Pdf URL: https://arxiv.org/pdf/2505.18358
Copy Paste: [[2505.18358]] CONCORD: Concept-Informed Diffusion for Dataset Distillation(https://arxiv.org/abs/2505.18358)
Keywords: interpretability, diffusion, generative, large language model
Abstract: Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in this https URL.

Title: SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases

Authors: AmirHossein Safdarian, Milad Mohammadi, Ehsan Jahanbakhsh, Mona Shahamat Naderi, Heshaam Faili
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.18363
Pdf URL: https://arxiv.org/pdf/2505.18363
Copy Paste: [[2505.18363]] SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases(https://arxiv.org/abs/2505.18363)
Keywords: large language model
Abstract: Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.

Title: Weakly-supervised Mamba-Based Mastoidectomy Shape Prediction for Cochlear Implant Surgery Using 3D T-Distribution Loss

Authors: Yike Zhang, Jack H. Noble
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18368
Pdf URL: https://arxiv.org/pdf/2505.18368
Copy Paste: [[2505.18368]] Weakly-supervised Mamba-Based Mastoidectomy Shape Prediction for Cochlear Implant Surgery Using 3D T-Distribution Loss(https://arxiv.org/abs/2505.18368)
Keywords: robust, segmentation
Abstract: Cochlear implant surgery is a treatment for individuals with severe hearing loss. It involves inserting an array of electrodes inside the cochlea to electrically stimulate the auditory nerve and restore hearing sensation. A crucial step in this procedure is mastoidectomy, a surgical intervention that removes part of the mastoid region of the temporal bone, providing a critical pathway to the cochlea for electrode placement. Accurate prediction of the mastoidectomy region from preoperative imaging assists presurgical planning, reduces surgical risks, and improves surgical outcomes. In previous work, a self-supervised network was introduced to predict the mastoidectomy region using only preoperative CT scans. While promising, the method suffered from suboptimal robustness, limiting its practical application. To address this limitation, we propose a novel weakly-supervised Mamba-based framework to predict accurate mastoidectomy regions directly from preoperative CT scans. Our approach utilizes a 3D T-Distribution loss function inspired by the Student-t distribution, which effectively handles the complex geometric variability inherent in mastoidectomy shapes. Weak supervision is achieved using the segmentation results from the prior self-supervised network to eliminate the need for manual data cleaning or labeling throughout the training process. The proposed method is extensively evaluated against state-of-the-art approaches, demonstrating superior performance in predicting accurate and clinically relevant mastoidectomy regions. Our findings highlight the robustness and efficiency of the weakly-supervised learning framework with the proposed novel 3D T-Distribution loss.

Title: Small Models, Smarter Learning: The Power of Joint Task Training

Authors: Csaba Both, Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Daniel Karl I. Weidele, Mauro Martino, Nima Dehmamy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18369
Pdf URL: https://arxiv.org/pdf/2505.18369
Copy Paste: [[2505.18369]] Small Models, Smarter Learning: The Power of Joint Task Training(https://arxiv.org/abs/2505.18369)
Keywords: transformer
Abstract: The ability of a model to learn a task depends strongly on both the task difficulty and the model size. We aim to understand how task difficulty relates to the minimum number of parameters required for learning specific tasks in small transformer models. Our study focuses on the ListOps dataset, which consists of nested mathematical operations. We gradually increase task difficulty by introducing new operations or combinations of operations into the training data. We observe that sum modulo n is the hardest to learn. Curiously, when combined with other operations such as maximum and median, the sum operation becomes easier to learn and requires fewer parameters. We show that joint training not only improves performance but also leads to qualitatively different model behavior. We show evidence that models trained only on SUM might be memorizing and fail to capture the number structure in the embeddings. In contrast, models trained on a mixture of SUM and other operations exhibit number-like representations in the embedding space, and a strong ability to distinguish parity. Furthermore, the SUM-only model relies more heavily on its feedforward layers, while the jointly trained model activates the attention mechanism more. Finally, we show that learning pure SUM can be induced in models below the learning threshold of pure SUM, by pretraining them on MAX+MED. Our findings indicate that emergent abilities in language models depend not only on model size, but also the training curriculum.

Title: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Authors: Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18383
Pdf URL: https://arxiv.org/pdf/2505.18383
Copy Paste: [[2505.18383]] NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities(https://arxiv.org/abs/2505.18383)
Keywords: large language model
Abstract: Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

Title: Dynamic Risk Assessments for Offensive Cybersecurity Agents

Authors: Boyi Wei, Benedikt Stroebl, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18384
Pdf URL: https://arxiv.org/pdf/2505.18384
Copy Paste: [[2505.18384]] Dynamic Risk Assessments for Offensive Cybersecurity Agents(https://arxiv.org/abs/2505.18384)
Keywords: security
Abstract: Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.

Title: Modeling interdependent privacy threats

Authors: Shuaishuai Liu, Gergely Biczók
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18386
Pdf URL: https://arxiv.org/pdf/2505.18386
Copy Paste: [[2505.18386]] Modeling interdependent privacy threats(https://arxiv.org/abs/2505.18386)
Keywords: privacy
Abstract: The rise of online social networks, user-gene-rated content, and third-party apps made data sharing an inevitable trend, driven by both user behavior and the commercial value of personal information. As service providers amass vast amounts of data, safeguarding individual privacy has become increasingly challenging. Privacy threat modeling has emerged as a critical tool for identifying and mitigating risks, with methodologies such as LINDDUN, xCOMPASS, and PANOPTIC offering systematic approaches. However, these frameworks primarily focus on threats arising from interactions between a single user and system components, often overlooking interdependent privacy (IDP); the phenomenon where one user's actions affect the privacy of other users and even non-users. IDP risks are particularly pronounced in third-party applications, where platform permissions, APIs, and user behavior can lead to unintended and unconsented data sharing, such as in the Cambridge Analytica case. We argue that existing threat modeling approaches are limited in exposing IDP-related threats, potentially underestimating privacy risks. To bridge this gap, we propose a specialized methodology that explicitly focuses on interdependent privacy. Our contributions are threefold: (i) we identify IDP-specific challenges and limitations in current threat modeling frameworks, (ii) we create IDPA, a threat modeling approach tailored to IDP threats, and (iii) we validate our approach through a case study on WeChat. We believe that IDPA can operate effectively on systems other than third-party apps and may motivate further research on specialized threat modeling.

Title: Applications of Modular Co-Design for De Novo 3D Molecule Generation

Authors: Danny Reidenbach, Filipp Nikitin, Olexandr Isayev, Saee Paliwal
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.18392
Pdf URL: https://arxiv.org/pdf/2505.18392
Copy Paste: [[2505.18392]] Applications of Modular Co-Design for De Novo 3D Molecule Generation(https://arxiv.org/abs/2505.18392)
Keywords: diffusion, transformer, generative
Abstract: De novo 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality 3D structures, even if they maintain 2D validity and topological stability. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon-a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective. We assess Megalodon's performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model's capability to generate realistic molecular structures, particularly focusing on energetics. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, doubling the number of parameters in Megalodon to 40M significantly enhances its performance, generating up to 49x more valid large molecules and achieving energy levels that are 2-10x lower than those of the best prior generative models.

Title: Towards Anonymous Neural Network Inference

Authors: Liao Peiyuan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18398
Pdf URL: https://arxiv.org/pdf/2505.18398
Copy Paste: [[2505.18398]] Towards Anonymous Neural Network Inference(https://arxiv.org/abs/2505.18398)
Keywords: security, privacy
Abstract: We introduce funion, a system providing end-to-end sender-receiver unlinkability for neural network inference. By leveraging the Pigeonhole storage protocol and BACAP (blinding-and-capability) scheme from the Echomix anonymity system, funion inherits the provable security guarantees of modern mixnets. Users can anonymously store input tensors in pseudorandom storage locations, commission compute services to process them via the neural network, and retrieve results with no traceable connection between input and output parties. This store-compute-store paradigm masks both network traffic patterns and computational workload characteristics, while quantizing execution timing into public latency buckets. Our security analysis demonstrates that funion inherits the strong metadata privacy guarantees of Echomix under largely the same trust assumptions, while introducing acceptable overhead for production-scale workloads. Our work paves the way towards an accessible platform where users can submit fully anonymized inference queries to cloud services.

Title: Taming Diffusion for Dataset Distillation with High Representativeness

Authors: Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18399
Pdf URL: https://arxiv.org/pdf/2505.18399
Copy Paste: [[2505.18399]] Taming Diffusion for Dataset Distillation with High Representativeness(https://arxiv.org/abs/2505.18399)
Keywords: diffusion
Abstract: Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: this https URL.

Title: AI/ML for 5G and Beyond Cybersecurity

Authors: Sandeep Pirbhulal, Habtamu Abie, Martin Jullum, Didrik Nielsen, Anders Løland
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18402
Pdf URL: https://arxiv.org/pdf/2505.18402
Copy Paste: [[2505.18402]] AI/ML for 5G and Beyond Cybersecurity(https://arxiv.org/abs/2505.18402)
Keywords: secure, security, attack
Abstract: The advancements in communication technology (5G and beyond) and global connectivity Internet of Things (IoT) also come with new security problems that will need to be addressed in the next few years. The threats and vulnerabilities introduced by AI/ML based 5G and beyond IoT systems need to be investigated to avoid the amplification of attack vectors on AI/ML. AI/ML techniques are playing a vital role in numerous applications of cybersecurity. Despite the ongoing success, there are significant challenges in ensuring the trustworthiness of AI/ML systems. However, further research is needed to define what is considered an AI/ML threat and how it differs from threats to traditional systems, as currently there is no common understanding of what constitutes an attack on AI/ML based systems, nor how it might be created, hosted and propagated [ETSI, 2020]. Therefore, there is a need for studying the AI/ML approach to ensure safe and secure development, deployment, and operation of AI/ML based 5G and beyond IoT systems. For 5G and beyond, it is essential to continuously monitor and analyze any changing environment in real-time to identify and reduce intentional and unintentional risks. In this study, we will review the role of the AI/ML technique for 5G and beyond security. Furthermore, we will provide our perspective for predicting and mitigating 5G and beyond security using AI/ML techniques.

Title: Thought calibration: Efficient and confident test-time scaling

Authors: Menghua Wu, Cai Zhou, Stephen Bates, Tommi Jaakkola
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18404
Pdf URL: https://arxiv.org/pdf/2505.18404
Copy Paste: [[2505.18404]] Thought calibration: Efficient and confident test-time scaling(https://arxiv.org/abs/2505.18404)
Keywords: large language model
Abstract: Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Title: RaDeR: Reasoning-aware Dense Retrieval Models

Authors: Debrup Das, Sam O' Nuallain, Razieh Rahimi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18405
Pdf URL: https://arxiv.org/pdf/2505.18405
Copy Paste: [[2505.18405]] RaDeR: Reasoning-aware Dense Retrieval Models(https://arxiv.org/abs/2505.18405)
Keywords: large language model
Abstract: We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall this http URL, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore, RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work REASONIR, highlighting the quality of our synthesized training data.

Title: KL-regularization Itself is Differentially Private in Bandits and RLHF

Authors: Yizhou Zhang, Kishan Panaganti, Laixi Shi, Juba Ziani, Adam Wierman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18407
Pdf URL: https://arxiv.org/pdf/2505.18407
Copy Paste: [[2505.18407]] KL-regularization Itself is Differentially Private in Bandits and RLHF(https://arxiv.org/abs/2505.18407)
Keywords: privacy
Abstract: Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.

Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Authors: Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18411
Pdf URL: https://arxiv.org/pdf/2505.18411
Copy Paste: [[2505.18411]] DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding(https://arxiv.org/abs/2505.18411)
Keywords: large language model
Abstract: We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at this https URL

Title: Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering

Authors: Jessica Tang, Ali Abedi, Tracey J.F. Colella, Shehroz S. Khan
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.18412
Pdf URL: https://arxiv.org/pdf/2505.18412
Copy Paste: [[2505.18412]] Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering(https://arxiv.org/abs/2505.18412)
Keywords: large language model
Abstract: Exercise-based rehabilitation improves quality of life and reduces morbidity, mortality, and rehospitalization, though transportation constraints and staff shortages lead to high dropout rates from rehabilitation programs. Virtual platforms enable patients to complete prescribed exercises at home, while AI algorithms analyze performance, deliver feedback, and update clinicians. Although many studies have developed machine learning and deep learning models for exercise quality assessment, few have explored the use of large language models (LLMs) for feedback and are limited by the lack of rehabilitation datasets containing textual feedback. In this paper, we propose a new method in which exercise-specific features are extracted from the skeletal joints of patients performing rehabilitation exercises and fed into pre-trained LLMs. Using a range of prompting techniques, such as zero-shot, few-shot, chain-of-thought, and role-play prompting, LLMs are leveraged to evaluate exercise quality and provide feedback in natural language to help patients improve their movements. The method was evaluated through extensive experiments on two publicly available rehabilitation exercise assessment datasets (UI-PRMD and REHAB24-6) and showed promising results in exercise assessment, reasoning, and feedback generation. This approach can be integrated into virtual rehabilitation platforms to help patients perform exercises correctly, support recovery, and improve health outcomes.

Title: LatentLLM: Attention-Aware Joint Tensor Compression

Authors: Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu (Perry)Wang, Matthew Brand
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18413
Pdf URL: https://arxiv.org/pdf/2505.18413
Copy Paste: [[2505.18413]] LatentLLM: Attention-Aware Joint Tensor Compression(https://arxiv.org/abs/2505.18413)
Keywords: large language model
Abstract: Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

Title: A Dual Basis Approach for Structured Robust Euclidean Distance Geometry

Authors: Chandra Kundu, Abiy Tasissa, HanQin Cai
Subjects: cs.LG, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18414
Pdf URL: https://arxiv.org/pdf/2505.18414
Copy Paste: [[2505.18414]] A Dual Basis Approach for Structured Robust Euclidean Distance Geometry(https://arxiv.org/abs/2505.18414)
Keywords: robust
Abstract: Euclidean Distance Matrix (EDM), which consists of pairwise squared Euclidean distances of a given point configuration, finds many applications in modern machine learning. This paper considers the setting where only a set of anchor nodes is used to collect the distances between themselves and the rest. In the presence of potential outliers, it results in a structured partial observation on EDM with partial corruptions. Note that an EDM can be connected to a positive semi-definite Gram matrix via a non-orthogonal dual basis. Inspired by recent development of non-orthogonal dual basis in optimization, we propose a novel algorithmic framework, dubbed Robust Euclidean Distance Geometry via Dual Basis (RoDEoDB), for recovering the Euclidean distance geometry, i.e., the underlying point configuration. The exact recovery guarantees have been established in terms of both the Gram matrix and point configuration, under some mild conditions. Empirical experiments show superior performance of RoDEoDB on sensor localization and molecular conformation datasets.

Title: CENet: Context Enhancement Network for Medical Image Segmentation

Authors: Afshin Bozorgpour, Sina Ghorbani Kolahi, Reza Azad, Ilker Hacihaliloglu, Dorit Merhof
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18423
Pdf URL: https://arxiv.org/pdf/2505.18423
Copy Paste: [[2505.18423]] CENet: Context Enhancement Network for Medical Image Segmentation(https://arxiv.org/abs/2505.18423)
Keywords: robust, segmentation
Abstract: Medical image segmentation, particularly in multi-domain scenarios, requires precise preservation of anatomical structures across diverse representations. While deep learning has advanced this field, existing models often struggle with accurate boundary representation, variability in organ morphology, and information loss during downsampling, limiting their accuracy and robustness. To address these challenges, we propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations. First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner. Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations. Extensive evaluations on both radiology and dermoscopic datasets demonstrate that CENet outperforms state-of-the-art (SOTA) methods in multi-organ segmentation and boundary detail preservation, offering a robust and accurate solution for complex medical image analysis tasks. The code is publicly available at this https URL.

Title: Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps

Authors: Khandakar Ashrafi Akbar, Md Nahiyan Uddin, Latifur Khan, Trayce Hockstad, Mizanur Rahman, Mashrur Chowdhury, Bhavani Thuraisingham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18426
Pdf URL: https://arxiv.org/pdf/2505.18426
Copy Paste: [[2505.18426]] Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps(https://arxiv.org/abs/2505.18426)
Keywords: security, privacy, large language model
Abstract: As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.

Title: TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

Authors: Yuliang Cai, Jesse Thomason, Mohammad Rostami
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18434
Pdf URL: https://arxiv.org/pdf/2505.18434
Copy Paste: [[2505.18434]] TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP(https://arxiv.org/abs/2505.18434)
Keywords: large language model
Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

Title: DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

Authors: Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer
Subjects: cs.LG, cs.MS, stat.AP
Abstract URL: https://arxiv.org/abs/2505.18441
Pdf URL: https://arxiv.org/pdf/2505.18441
Copy Paste: [[2505.18441]] DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces(https://arxiv.org/abs/2505.18441)
Keywords: interpretability, transformer
Abstract: Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at this https URL.

Title: OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

Authors: Yiren Song, Cheng Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18445
Pdf URL: https://arxiv.org/pdf/2505.18445
Copy Paste: [[2505.18445]] OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data(https://arxiv.org/abs/2505.18445)
Keywords: robust, diffusion, transformer
Abstract: Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.

Title: Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling

Authors: Hojun Son, Asma Almutairi, Arpan Kusari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18446
Pdf URL: https://arxiv.org/pdf/2505.18446
Copy Paste: [[2505.18446]] Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling(https://arxiv.org/abs/2505.18446)
Keywords: robust
Abstract: Context bias refers to the association between the foreground objects and background during the object detection training process. Various methods have been proposed to minimize the context bias when applying the trained model to an unseen domain, known as domain adaptation for object detection (DAOD). But a principled approach to understand why the context bias occurs and how to remove it has been missing. In this work, we provide a causal view of the context bias, pointing towards the pooling operation in the convolution network architecture as the possible source of this bias. We present an alternative, Mask Pooling, which uses an additional input of foreground masks, to separate the pooling process in the respective foreground and background regions and show that this process leads the trained model to detect objects in a more robust manner under different domains. We also provide a benchmark designed to create an ultimate test for DAOD, using foregrounds in the presence of absolute random backgrounds, to analyze the robustness of the intended trained models. Through these experiments, we hope to provide a principled approach for minimizing context bias under domain shift.

Title: Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning

Authors: Chi Zhang, Ziying Jia, George K. Atia, Sihong He, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18447
Pdf URL: https://arxiv.org/pdf/2505.18447
Copy Paste: [[2505.18447]] Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning(https://arxiv.org/abs/2505.18447)
Keywords: robust
Abstract: Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain's performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.

Title: BRIT: Bidirectional Retrieval over Unified Image-Text Graph

Authors: Ainulla Khan, Yamada Moyuru, Srinidhi Akella
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18450
Pdf URL: https://arxiv.org/pdf/2505.18450
Copy Paste: [[2505.18450]] BRIT: Bidirectional Retrieval over Unified Image-Text Graph(https://arxiv.org/abs/2505.18450)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.

Title: MedScore: Factuality Evaluation of Free-Form Medical Answers

Authors: Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18452
Pdf URL: https://arxiv.org/pdf/2505.18452
Copy Paste: [[2505.18452]] MedScore: Factuality Evaluation of Free-Form Medical Answers(https://arxiv.org/abs/2505.18452)
Keywords: large language model
Abstract: While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.

Title: Hybrid Latent Reasoning via Reinforcement Learning

Authors: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18454
Pdf URL: https://arxiv.org/pdf/2505.18454
Copy Paste: [[2505.18454]] Hybrid Latent Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.18454)
Keywords: generative, large language model
Abstract: Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Title: Anchored Diffusion Language Model

Authors: Litu Rout, Constantine Caramanis, Sanjay Shakkottai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18456
Pdf URL: https://arxiv.org/pdf/2505.18456
Copy Paste: [[2505.18456]] Anchored Diffusion Language Model(https://arxiv.org/abs/2505.18456)
Keywords: diffusion
Abstract: Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches

Title: Measuring South Asian Biases in Large Language Models

Authors: Mamnuya Rinki, Chahat Raj, Anjishnu Mukherjee, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18466
Pdf URL: https://arxiv.org/pdf/2505.18466
Copy Paste: [[2505.18466]] Measuring South Asian Biases in Large Language Models(https://arxiv.org/abs/2505.18466)
Keywords: generative, large language model
Abstract: Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.

Title: HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model

Authors: Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18469
Pdf URL: https://arxiv.org/pdf/2505.18469
Copy Paste: [[2505.18469]] HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model(https://arxiv.org/abs/2505.18469)
Keywords: diffusion
Abstract: Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more "honest" in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at this https URL .

Title: Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Authors: Guoheng Sun, Ziyao Wang, Xuandong Zhao, Bowei Tian, Zheyu Shen, Yexiao He, Jinming Xing, Ang Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18471
Pdf URL: https://arxiv.org/pdf/2505.18471
Copy Paste: [[2505.18471]] Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services(https://arxiv.org/abs/2505.18471)
Keywords: secure, watermark, large language model
Abstract: Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textit{quantity inflation}, where token and call counts may be artificially inflated, and \textit{quality downgrade}, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.

Title: Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey

Authors: Mengran Li, Pengyu Zhang, Wenbin Xing, Yijia Zheng, Klim Zaporojets, Junzhou Chen, Ronghui Zhang, Yong Zhang, Siyuan Gong, Jia Hu, Xiaolei Ma, Zhiyuan Liu, Paul Groth, Marcel Worring
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18475
Pdf URL: https://arxiv.org/pdf/2505.18475
Copy Paste: [[2505.18475]] Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey(https://arxiv.org/abs/2505.18475)
Keywords: large language model
Abstract: Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. Conventional graph learning approaches typically rely on fixed structural assumptions or fully observed data, limiting their effectiveness in more complex, noisy, or evolving settings. Consequently, real-world graph data often violates the assumptions of traditional graph learning methods, in particular, it leads to four fundamental challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recent advances in Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey provides a comprehensive review of how LLMs can be integrated with graph learning to address the aforementioned challenges. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: this https URL.

Title: Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Authors: Li-Syun Hsiung, Jun-Kai Tu, Kuan-Wu Chu, Yu-Hsuan Chiu, Yan-Tsung Peng, Sheng-Luen Chung, Gee-Sern Jison Hsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18479
Pdf URL: https://arxiv.org/pdf/2505.18479
Copy Paste: [[2505.18479]] Syn3DTxt: Embedding 3D Cues for Scene Text Generation(https://arxiv.org/abs/2505.18479)
Keywords: robust, diffusion
Abstract: This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.

Title: The Prompt is Mightier than the Example

Authors: Shengzhe Xu, Nikhil Muralidhar, Naren Ramakrishnan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18485
Pdf URL: https://arxiv.org/pdf/2505.18485
Copy Paste: [[2505.18485]] The Prompt is Mightier than the Example(https://arxiv.org/abs/2505.18485)
Keywords: large language model
Abstract: Numerous recent prompt optimization approaches like chain-of-thought, have been demonstrated to significantly improve the quality of content generated by large language models (LLMs). In-context learning (ICL), a recent paradigm where a few representative examples guide content generation has also led to strong improvements in generation quality of LLM generated content. This idea has been applied to great effect in synthetic tabular data generation, where LLMs, through effective use of ICL and prompt optimization, can generate data that approximate samples from complex, heterogeneous distributions based on representative examples. However, ensuring high-fidelity synthetic data often requires a very large number of ICL examples which may be unavailable or costly to obtain. At the same time, as LLMs get larger and larger, their in-built prior knowledge becomes vast and can potentially substitute for specific data examples. In this paper, we introduce Knowledge-Guided Prompting (KGP) as a new knob in prompt optimization and explore the ability of KGP-based prompt optimization to offset the cost of ICL. Specifically, we explore the question `how many examples can a prompt substitute for?' and explore knowledge-guided prompting (KGP) where domain knowledge, either inferred or available, is explicitly injected into the prompt, reducing dependence on ICL examples. Our experiments systematically explore the trade-off between ICL and KGP, revealing an empirical scaling law that quantifies how quality of generated synthetic data varies with increasing domain knowledge and decreasing example count. Our results demonstrate that knowledge-guided prompting can be a scalable alternative, or addition, to in-context examples, unlocking new approaches to synthetic data generation.

Title: Investigating AI Rater Effects of Large Language Models: GPT, Claude, Gemini, and DeepSeek

Authors: Hong Jiao, Dan Song, Won-Chan Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18486
Pdf URL: https://arxiv.org/pdf/2505.18486
Copy Paste: [[2505.18486]] Investigating AI Rater Effects of Large Language Models: GPT, Claude, Gemini, and DeepSeek(https://arxiv.org/abs/2505.18486)
Keywords: large language model
Abstract: Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.

Title: Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications

Authors: Yanxiang Zhang, Zheng Xu, Shanshan Wu, Yuanbo Zhang, Daniel Ramage
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18488
Pdf URL: https://arxiv.org/pdf/2505.18488
Copy Paste: [[2505.18488]] Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications(https://arxiv.org/abs/2505.18488)
Keywords: privacy, large language model
Abstract: Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.

Title: FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation

Authors: Zihao Peng, Jiandian Zeng, Boyuan Li, Guo Li, Shengbo Chen, Tian Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18494
Pdf URL: https://arxiv.org/pdf/2505.18494
Copy Paste: [[2505.18494]] FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation(https://arxiv.org/abs/2505.18494)
Keywords: federate
Abstract: Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources, with Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance. While recent work acknowledges the benefits of heterogeneous LoRA in FL and introduces flexible algorithms to support its implementation, our theoretical analysis reveals a critical gap: existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates. Specifically, adapting client-specific LoRA ranks necessitates truncating global parameters, which introduces inherent truncation errors and leads to subsequent inaccurate gradient updates that accumulate over training rounds, ultimately degrading performance. To address the above issues, we propose \textbf{FedHL}, a simple yet effective \textbf{Fed}erated Learning framework tailored for \textbf{H}eterogeneous \textbf{L}oRA. By leveraging the full-rank global model as a calibrated aggregation basis, FedHL eliminates the direct truncation bias from initial alignment with client-specific ranks. Furthermore, we derive the theoretically optimal aggregation weights by minimizing the gradient drift term in the convergence upper bound. Our analysis shows that FedHL guarantees $\mathcal{O}(1/\sqrt{T})$ convergence rate, and experiments on multiple real-world datasets demonstrate a 1-3\% improvement over several state-of-the-art methods.

Title: Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Authors: Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18495
Pdf URL: https://arxiv.org/pdf/2505.18495
Copy Paste: [[2505.18495]] Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking(https://arxiv.org/abs/2505.18495)
Keywords: diffusion, generative
Abstract: Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

Title: The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Authors: Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18497
Pdf URL: https://arxiv.org/pdf/2505.18497
Copy Paste: [[2505.18497]] The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models(https://arxiv.org/abs/2505.18497)
Keywords: large language model
Abstract: Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

Title: G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning

Authors: Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, Yisen Wang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18499
Pdf URL: https://arxiv.org/pdf/2505.18499
Copy Paste: [[2505.18499]] G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning(https://arxiv.org/abs/2505.18499)
Keywords: large language model
Abstract: Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.

Title: Enhancing Training Data Attribution with Representational Optimization

Authors: Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, Yiming Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18513
Pdf URL: https://arxiv.org/pdf/2505.18513
Copy Paste: [[2505.18513]] Enhancing Training Data Attribution with Representational Optimization(https://arxiv.org/abs/2505.18513)
Keywords: robust
Abstract: Training data attribution (TDA) methods aim to measure how training data impacts a model's predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at this https URL.

Title: A Study of Semi-Fungible Token based Wi-Fi Access Control

Authors: Litao Ye, Bin Chen, Chen Sun, Shuo Wang, Peichang Zhang, Shengli Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18518
Pdf URL: https://arxiv.org/pdf/2505.18518
Copy Paste: [[2505.18518]] A Study of Semi-Fungible Token based Wi-Fi Access Control(https://arxiv.org/abs/2505.18518)
Keywords: secure, security, privacy
Abstract: Current Wi-Fi authentication methods face issues such as insufficient security, user privacy leakage, high management costs, and difficulty in billing. To address these challenges, a Wi-Fi access control solution based on blockchain smart contracts is proposed. Firstly, semi-fungible Wi-Fi tokens (SFWTs) are designed using the ERC1155 token standard as credentials for users to access Wi-Fi. Secondly, a Wi-Fi access control system based on SFWTs is developed to securely verify and manage the access rights of Wi-Fi users. Experimental results demonstrate that SFWTs, designed based on the ERC1155 standard, along with the SFWT access right verification process, can significantly reduce Wi-Fi operating costs and authentication time, effectively meeting users' needs for safe and convenient Wi-Fi access.

Title: Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

Authors: Yiheng Li, Feng Liang, Dan Kondratyuk, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18521
Pdf URL: https://arxiv.org/pdf/2505.18521
Copy Paste: [[2505.18521]] Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility(https://arxiv.org/abs/2505.18521)
Keywords: diffusion, generative
Abstract: The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to >4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at this https URL.

Title: How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Authors: Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18522
Pdf URL: https://arxiv.org/pdf/2505.18522
Copy Paste: [[2505.18522]] How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation(https://arxiv.org/abs/2505.18522)
Keywords: transformer
Abstract: Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

Title: metaTextGrad: Automatically optimizing language model optimizers

Authors: Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18524
Pdf URL: https://arxiv.org/pdf/2505.18524
Copy Paste: [[2505.18524]] metaTextGrad: Automatically optimizing language model optimizers(https://arxiv.org/abs/2505.18524)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.

Title: TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation

Authors: Haoyu Yang, Yuxiang Cai, Jintao Chen, Xuhong Zhang, Wenhui Lei, Xiaoming Shi, Jianwei Yin, Yankai Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18525
Pdf URL: https://arxiv.org/pdf/2505.18525
Copy Paste: [[2505.18525]] TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation(https://arxiv.org/abs/2505.18525)
Keywords: transformer, segmentation
Abstract: 3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at this https URL.

Title: CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs

Authors: Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18527
Pdf URL: https://arxiv.org/pdf/2505.18527
Copy Paste: [[2505.18527]] CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs(https://arxiv.org/abs/2505.18527)
Keywords: large language model
Abstract: Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from this https URL.

Title: Preserving AUC Fairness in Learning with Noisy Protected Groups

Authors: Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18532
Pdf URL: https://arxiv.org/pdf/2505.18532
Copy Paste: [[2505.18532]] Preserving AUC Fairness in Learning with Noisy Protected Groups(https://arxiv.org/abs/2505.18532)
Keywords: protect, robust, fair
Abstract: The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in this https URL.

Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Authors: Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18536
Pdf URL: https://arxiv.org/pdf/2505.18536
Copy Paste: [[2505.18536]] Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models(https://arxiv.org/abs/2505.18536)
Keywords: large language model
Abstract: Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at this https URL.

Title: Business as \textit{Rule}sual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Authors: Chen Yang, Ruping Xu, Ruizhe Li, Bin Cao, Jing Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18542
Pdf URL: https://arxiv.org/pdf/2505.18542
Copy Paste: [[2505.18542]] Business as \textit{Rule}sual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs(https://arxiv.org/abs/2505.18542)
Keywords: extraction, large language model
Abstract: Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, \textbf{BPRF}, which contains 50 business process documents with 326 explicitly labeled business rules across multiple domains. Each rule is represented as a pair, and we annotate logical dependencies between rules (sequential, conditional, or parallel). We also propose \textbf{ExIde}, a framework for automatic business rule extraction and dependency relationship identification using large language models (LLMs). We evaluate ExIde using 12 state-of-the-art (SOTA) LLMs on the BPRF dataset, benchmarking performance on both rule extraction and dependency classification tasks of current LLMs. Our results demonstrate the effectiveness of ExIde in extracting structured business rules and analyzing their interdependencies for current SOTA LLMs, paving the way for more automated and interpretable business process automation.

Title: Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

Authors: Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu
Subjects: cs.CR, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18543
Pdf URL: https://arxiv.org/pdf/2505.18543
Copy Paste: [[2505.18543]] Benchmarking Poisoning Attacks against Retrieval-Augmented Generation(https://arxiv.org/abs/2505.18543)
Keywords: security, protect, defense, attack, robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) has proven effective in mitigating hallucinations in large language models by incorporating external knowledge during inference. However, this integration introduces new security vulnerabilities, particularly to poisoning attacks. Although prior work has explored various poisoning strategies, a thorough assessment of their practical threat to RAG systems remains missing. To address this gap, we propose the first comprehensive benchmark framework for evaluating poisoning attacks on RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10 expanded variants, along with 13 poisoning attack methods and 7 defense mechanisms, representing a broad spectrum of existing techniques. Using this benchmark, we conduct a comprehensive evaluation of all included attacks and defenses across the full dataset spectrum. Our findings show that while existing attacks perform well on standard QA datasets, their effectiveness drops significantly on the expanded versions. Moreover, our results demonstrate that various advanced RAG architectures, such as sequential, branching, conditional, and loop RAG, as well as multi-turn conversational RAG, multimodal RAG systems, and RAG-based LLM agent systems, remain susceptible to poisoning attacks. Notably, current defense techniques fail to provide robust protection, underscoring the pressing need for more resilient and generalizable defense strategies.

Title: B-score: Detecting biases in large language models using response history

Authors: An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18545
Pdf URL: https://arxiv.org/pdf/2505.18545
Copy Paste: [[2505.18545]] B-score: Detecting biases in large language models using response history(https://arxiv.org/abs/2505.18545)
Keywords: large language model
Abstract: Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: this https URL.

Title: Composable Cross-prompt Essay Scoring by Merging Models

Authors: Sanwoo Lee, Kun Liang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18548
Pdf URL: https://arxiv.org/pdf/2505.18548
Copy Paste: [[2505.18548]] Composable Cross-prompt Essay Scoring by Merging Models(https://arxiv.org/abs/2505.18548)
Keywords: privacy, robust
Abstract: Recent advances in cross-prompt automated essay scoring (AES) typically train models jointly on all source prompts, often requiring additional access to unlabeled target prompt essays simultaneously. However, using all sources is suboptimal in our pilot study, and re-accessing source datasets during adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges individually trained source models' parameters instead of datasets. In particular, we simulate joint training through linear combinations of task vectors -- the parameter updates from fine-tuning. To optimize the combination's coefficients, we propose Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes the model's score discriminability regularized by priors pre-computed from the sources. We employ Bayesian optimization as an efficient optimizer of PIM. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms training jointly on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.

Title: MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Authors: Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18549
Pdf URL: https://arxiv.org/pdf/2505.18549
Copy Paste: [[2505.18549]] MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors(https://arxiv.org/abs/2505.18549)
Keywords: robust
Abstract: We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Title: LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

Authors: Md Ahsanul Haque, Ismail Hossain, Md Mahmuduzzaman Kamol, Md Jahangir Alam, Suresh Kumar Amalapuram, Sajedul Talukder, Mohammad Saidur Rahman
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18551
Pdf URL: https://arxiv.org/pdf/2505.18551
Copy Paste: [[2505.18551]] LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis(https://arxiv.org/abs/2505.18551)
Keywords: explainability
Abstract: Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift -- distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at: this https URL.

Title: Unraveling Misinformation Propagation in LLM Reasoning

Authors: Yiyang Feng, Yichen Wang, Shaobo Cui, Boi Faltings, Mina Lee, Jiawei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18555
Pdf URL: https://arxiv.org/pdf/2505.18555
Copy Paste: [[2505.18555]] Unraveling Misinformation Propagation in LLM Reasoning(https://arxiv.org/abs/2505.18555)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by misinformation, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs' reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifying misinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02% - 72.20%). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation.

Title: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Authors: Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18556
Pdf URL: https://arxiv.org/pdf/2505.18556
Copy Paste: [[2505.18556]] Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation(https://arxiv.org/abs/2505.18556)
Keywords: defense, attack, robust, large language model
Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

Title: TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation

Authors: He Zhu, Zhiwen Ruan, Junyou Su, Xingwei He, Wenjia Zhang, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18557
Pdf URL: https://arxiv.org/pdf/2505.18557
Copy Paste: [[2505.18557]] TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation(https://arxiv.org/abs/2505.18557)
Keywords: large language model
Abstract: High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.

Title: Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning

Authors: Wenbo He, Zhijian Ou
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18558
Pdf URL: https://arxiv.org/pdf/2505.18558
Copy Paste: [[2505.18558]] Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning(https://arxiv.org/abs/2505.18558)
Keywords: robust, generative
Abstract: Our examination of existing deep generative models (DGMs), including VAEs and GANs, reveals two problems. First, their capability in handling discrete observations and latent codes is unsatisfactory, though there are interesting efforts. Second, both VAEs and GANs optimize some criteria that are indirectly related to the data likelihood. To address these problems, we formally present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models, with application to semi-supervised learning. The JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence the between the posteriori and the inference model. We provide theoretical results and conduct a series of experiments to show its superiority such as being robust to structure mismatch between encoder and decoder, consistent handling of both discrete and continuous variables. Particularly we empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks over the widely adopted datasets - MNIST and SVHN. To the best of our knowledge, this is the first demonstration that discrete latent variable models are successfully applied in the challenging semi-supervised tasks.

Title: ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

Authors: Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18561
Pdf URL: https://arxiv.org/pdf/2505.18561
Copy Paste: [[2505.18561]] ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts(https://arxiv.org/abs/2505.18561)
Keywords: large language model, segmentation
Abstract: Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.

Title: From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test

Authors: Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18562
Pdf URL: https://arxiv.org/pdf/2505.18562
Copy Paste: [[2505.18562]] From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test(https://arxiv.org/abs/2505.18562)
Keywords: large language model
Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.

Title: Learning without Isolation: Pathway Protection for Continual Learning

Authors: Zhikang Chen, Abudukelimu Wuerkaixi, Sen Cui, Haoxuan Li, Ding Li, Jingfeng Zhang, Bo Han, Gang Niu, Houfang Liu, Yi Yang, Sifan Yang, Changshui Zhang, Tianling Ren
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18568
Pdf URL: https://arxiv.org/pdf/2505.18568
Copy Paste: [[2505.18568]] Learning without Isolation: Pathway Protection for Continual Learning(https://arxiv.org/abs/2505.18568)
Keywords: protect
Abstract: Deep networks are prone to catastrophic forgetting during sequential task learning, i.e., losing the knowledge about old tasks upon learning new tasks. To this end, continual learning(CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the old-task knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a dual opinion from neuroscience and physics to CL: in the whole networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation(LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-efficient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.

Title: Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

Authors: Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18573
Pdf URL: https://arxiv.org/pdf/2505.18573
Copy Paste: [[2505.18573]] Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs(https://arxiv.org/abs/2505.18573)
Keywords: large language model
Abstract: Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: this https URL

Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG

Authors: Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18581
Pdf URL: https://arxiv.org/pdf/2505.18581
Copy Paste: [[2505.18581]] Removal of Hallucination on Hallucination: Debate-Augmented RAG(https://arxiv.org/abs/2505.18581)
Keywords: robust
Abstract: Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at this https URL.

Title: On Denoising Walking Videos for Gait Recognition

Authors: Dongyang Jin, Chao Fan, Jingzhe Ma, Jingkai Zhou, Weihua Chen, Shiqi Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18582
Pdf URL: https://arxiv.org/pdf/2505.18582
Copy Paste: [[2505.18582]] On Denoising Walking Videos for Gait Recognition(https://arxiv.org/abs/2505.18582)
Keywords: diffusion, generative
Abstract: To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at this https URL.

Title: Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Authors: Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18584
Pdf URL: https://arxiv.org/pdf/2505.18584
Copy Paste: [[2505.18584]] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations(https://arxiv.org/abs/2505.18584)
Keywords: diffusion, transformer
Abstract: Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

Title: Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

Authors: Chengxi Min, Wei Wang, Yahui Liu, Weixin Ye, Enver Sangineto, Qi Wang, Yao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18586
Pdf URL: https://arxiv.org/pdf/2505.18586
Copy Paste: [[2505.18586]] Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing(https://arxiv.org/abs/2505.18586)
Keywords: segmentation
Abstract: Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.

Title: HyperFake: Hyperspectral Reconstruction and Attention-Guided Analysis for Advanced Deepfake Detection

Authors: Pavan C Shekar, Pawan Soni, Vivek Kanhangad
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18587
Pdf URL: https://arxiv.org/pdf/2505.18587
Copy Paste: [[2505.18587]] HyperFake: Hyperspectral Reconstruction and Attention-Guided Analysis for Advanced Deepfake Detection(https://arxiv.org/abs/2505.18587)
Keywords: security, transformer
Abstract: Deepfakes pose a significant threat to digital media security, with current detection methods struggling to generalize across different manipulation techniques and datasets. While recent approaches combine CNN-based architectures with Vision Transformers or leverage multi-modal learning, they remain limited by the inherent constraints of RGB data. We introduce HyperFake, a novel deepfake detection pipeline that reconstructs 31-channel hyperspectral data from standard RGB videos, revealing hidden manipulation traces invisible to conventional methods. Using an improved MST++ architecture, HyperFake enhances hyperspectral reconstruction, while a spectral attention mechanism selects the most critical spectral features for deepfake detection. The refined spectral data is then processed by an EfficientNet-based classifier optimized for spectral analysis, enabling more accurate and generalizable detection across different deepfake styles and datasets, all without the need for expensive hyperspectral cameras. To the best of our knowledge, this is the first approach to leverage hyperspectral imaging reconstruction for deepfake detection, opening new possibilities for detecting increasingly sophisticated manipulations.

Title: Safety Alignment via Constrained Knowledge Unlearning

Authors: Zesheng Shi, Yucheng Zhou, Jing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18588
Pdf URL: https://arxiv.org/pdf/2505.18588
Copy Paste: [[2505.18588]] Safety Alignment via Constrained Knowledge Unlearning(https://arxiv.org/abs/2505.18588)
Keywords: defense, attack, large language model
Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.

Title: EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

Authors: GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18594
Pdf URL: https://arxiv.org/pdf/2505.18594
Copy Paste: [[2505.18594]] EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models(https://arxiv.org/abs/2505.18594)
Keywords: generative, large language model
Abstract: Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

Title: MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.18595
Pdf URL: https://arxiv.org/pdf/2505.18595
Copy Paste: [[2505.18595]] MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations(https://arxiv.org/abs/2505.18595)
Keywords: robust, large language model
Abstract: We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

Title: Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Authors: Chen Han, Wenzhen Zheng, Xijin Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18596
Pdf URL: https://arxiv.org/pdf/2505.18596
Copy Paste: [[2505.18596]] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models(https://arxiv.org/abs/2505.18596)
Keywords: robust, large language model
Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.

Title: Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Authors: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18600
Pdf URL: https://arxiv.org/pdf/2505.18600
Copy Paste: [[2505.18600]] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment(https://arxiv.org/abs/2505.18600)
Keywords: diffusion
Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.

Title: Flex-Judge: Think Once, Judge Anywhere

Authors: Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18601
Pdf URL: https://arxiv.org/pdf/2505.18601
Copy Paste: [[2505.18601]] Flex-Judge: Think Once, Judge Anywhere(https://arxiv.org/abs/2505.18601)
Keywords: robust, generative, large language model
Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Title: Rethinking Causal Mask Attention for Vision-Language Inference

Authors: Xiaohuan Pei, Tao Huang, YanXiang Ma, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18605
Pdf URL: https://arxiv.org/pdf/2505.18605
Copy Paste: [[2505.18605]] Rethinking Causal Mask Attention for Vision-Language Inference(https://arxiv.org/abs/2505.18605)
Keywords: generative, large language model
Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

Title: Spiking Transformers Need High Frequency Information

Authors: Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18608
Pdf URL: https://arxiv.org/pdf/2505.18608
Copy Paste: [[2505.18608]] Spiking Transformers Need High Frequency Information(https://arxiv.org/abs/2505.18608)
Keywords: transformer
Abstract: Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In this work, however, we reveal for the first time that spiking neurons preferentially propagate low-frequency information. We hypothesize that the rapid dissipation of high-frequency components is the primary cause of performance degradation. For example, on Cifar-100, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73%; interestingly, replacing it with Max-Pooling (high-pass) pushes the top-1 accuracy to 79.12%, surpassing the well-tuned Spikformer baseline by 0.97%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: extra Max-Pooling in patch embedding and Depth-Wise Convolution in place of self-attention. Notably, our Max-Former (63.99 M) hits the top-1 accuracy of 82.39% on ImageNet, showing a +7.58% improvement over Spikformer with comparable model size (74.81%, 66.34 M). We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks, beyond the established practice in standard deep learning.

Title: PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Authors: Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18610
Pdf URL: https://arxiv.org/pdf/2505.18610
Copy Paste: [[2505.18610]] PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs(https://arxiv.org/abs/2505.18610)
Keywords: transformer, large language model
Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at this https URL.

Title: Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Authors: Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18612
Pdf URL: https://arxiv.org/pdf/2505.18612
Copy Paste: [[2505.18612]] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter(https://arxiv.org/abs/2505.18612)
Keywords: diffusion, transformer
Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pretrained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It incorporates vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pretraining strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.

Title: MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

Authors: Faithful Chiagoziem Onwuegbuche, Adelodun Olaoluwa, Anca Delia Jurcut, Liliana Pasquale
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18613
Pdf URL: https://arxiv.org/pdf/2505.18613
Copy Paste: [[2505.18613]] MLRan: A Behavioural Dataset for Ransomware Analysis and Detection(https://arxiv.org/abs/2505.18613)
Keywords: security, extraction
Abstract: Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at this https URL.

Title: Anonymity-washing

Authors: Szivia Lestyán, William Letrone, Ludovica Robustelli, Gergely Biczók
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2505.18627
Pdf URL: https://arxiv.org/pdf/2505.18627
Copy Paste: [[2505.18627]] Anonymity-washing(https://arxiv.org/abs/2505.18627)
Keywords: privacy, protect
Abstract: Anonymization is a foundational principle of data privacy regulation, yet its practical application remains riddled with ambiguity and inconsistency. This paper introduces the concept of anonymity-washing -- the misrepresentation of the anonymity level of ``sanitized'' personal data -- as a critical privacy concern. While both legal and technical critiques of anonymization exist, they tend to address isolated aspects of the problem. In contrast, this paper offers a comprehensive overview of the conditions that enable anonymity-washing. It synthesizes fragmented legal interpretations, technical misunderstandings, and outdated regulatory guidance and complements them with a systematic review of national and international resources, including legal cases, data protection authority guidelines, and technical documentation. Our findings reveal a lack of coherent support for practitioners, contributing to the persistent misuse of pseudonymization and obsolete anonymization techniques. We conclude by recommending targeted education, clearer technical guidance, and closer cooperation between regulators, researchers, and industry to bridge the gap between legal norms and technical reality.

Title: Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Authors: Yixuan Wang, Yijun Liu, Shiyu ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18629
Pdf URL: https://arxiv.org/pdf/2505.18629
Copy Paste: [[2505.18629]] Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding(https://arxiv.org/abs/2505.18629)
Keywords: large language model
Abstract: Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 5$\sim$15\% improvements in decoding speed.

Title: DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

Authors: Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.18630
Pdf URL: https://arxiv.org/pdf/2505.18630
Copy Paste: [[2505.18630]] DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation(https://arxiv.org/abs/2505.18630)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.

Title: SerendibCoins: Exploring The Sri Lankan Coins Dataset

Authors: NH Wanigasingha, ES Sithpahan, MKA Ariyaratne, PRS De Silva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18634
Pdf URL: https://arxiv.org/pdf/2505.18634
Copy Paste: [[2505.18634]] SerendibCoins: Exploring The Sri Lankan Coins Dataset(https://arxiv.org/abs/2505.18634)
Keywords: robust
Abstract: The recognition and classification of coins are essential in numerous financial and automated systems. This study introduces a comprehensive Sri Lankan coin image dataset and evaluates its impact on machine learning model accuracy for coin classification. We experiment with traditional machine learning classifiers K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forest as well as a custom Convolutional Neural Network (CNN) to benchmark performance at different levels of classification. Our results show that SVM outperforms KNN and Random Forest in traditional classification approaches, while the CNN model achieves near-perfect classification accuracy with minimal misclassifications. The dataset demonstrates significant potential in enhancing automated coin recognition systems, offering a robust foundation for future research in regional currency classification and deep learning applications.

Title: Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

Authors: Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18638
Pdf URL: https://arxiv.org/pdf/2505.18638
Copy Paste: [[2505.18638]] Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models(https://arxiv.org/abs/2505.18638)
Keywords: large language model
Abstract: In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: this https URL.

Title: ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation

Authors: Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18640
Pdf URL: https://arxiv.org/pdf/2505.18640
Copy Paste: [[2505.18640]] ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation(https://arxiv.org/abs/2505.18640)
Keywords: robust
Abstract: Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: this https URL.

Title: Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Authors: Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18642
Pdf URL: https://arxiv.org/pdf/2505.18642
Copy Paste: [[2505.18642]] Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster(https://arxiv.org/abs/2505.18642)
Keywords: large language model
Abstract: Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.

Title: Flow Matching for Geometric Trajectory Simulation

Authors: Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18647
Pdf URL: https://arxiv.org/pdf/2505.18647
Copy Paste: [[2505.18647]] Flow Matching for Geometric Trajectory Simulation(https://arxiv.org/abs/2505.18647)
Keywords: generative
Abstract: The simulation of N-body systems is a fundamental problem with applications in a wide range of fields, such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances based on deep generative modeling and geometric deep learning have enabled probabilistic simulation by modeling complex distributions over trajectories while respecting the permutation symmetry that is fundamental to N-body systems. However, to generate realistic trajectories, existing methods must learn complex transformations starting from uninformed noise and do not allow for the exploitation of domain-informed priors. In this work, we propose STFlow to address this limitation. By leveraging flow matching and data-dependent couplings, STFlow facilitates physics-informed simulation of geometric trajectories without sacrificing model expressivity or scalability. Our evaluation on N-body dynamical systems, molecular dynamics, and pedestrian dynamics benchmarks shows that STFlow produces significantly lower prediction errors while enabling more efficient inference, highlighting the benefits of employing physics-informed prior distributions in probabilistic geometric trajectory modeling.

Title: ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

Authors: Xiaodong Wang, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18650
Pdf URL: https://arxiv.org/pdf/2505.18650
Copy Paste: [[2505.18650]] ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos(https://arxiv.org/abs/2505.18650)
Keywords: diffusion
Abstract: Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

Title: On the Emergence of Linear Analogies in Word Embeddings

Authors: Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart
Subjects: cs.CL, cond-mat.dis-nn, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18651
Pdf URL: https://arxiv.org/pdf/2505.18651
Copy Paste: [[2505.18651]] On the Emergence of Linear Analogies in Word Embeddings(https://arxiv.org/abs/2505.18651)
Keywords: robust, generative
Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

Title: Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

Authors: Yicheng Lin, Yunlong Jiang, Xujia Jiao, Bin Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18652
Pdf URL: https://arxiv.org/pdf/2505.18652
Copy Paste: [[2505.18652]] Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU(https://arxiv.org/abs/2505.18652)
Keywords: robust, extraction
Abstract: Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at this https URL.

Title: Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change

Authors: Murathan Kurfalı, Shorouq Zahra, Joakim Nivre, Gabriele Messori
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18653
Pdf URL: https://arxiv.org/pdf/2505.18653
Copy Paste: [[2505.18653]] Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change(https://arxiv.org/abs/2505.18653)
Keywords: extraction, large language model
Abstract: Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.

Title: LLM-QFL: Distilling Large Language Model for Quantum Federated Learning

Authors: Dev Gurung, Shiva Raj Pokhrel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18656
Pdf URL: https://arxiv.org/pdf/2505.18656
Copy Paste: [[2505.18656]] LLM-QFL: Distilling Large Language Model for Quantum Federated Learning(https://arxiv.org/abs/2505.18656)
Keywords: privacy, federate, large language model
Abstract: Inspired by the power of large language models (LLMs), our research adapts them to quantum federated learning (QFL) to boost efficiency and performance. We propose a federated fine-tuning method that distills an LLM within QFL, allowing each client to locally adapt the model to its own data while preserving privacy and reducing unnecessary global updates. The fine-tuned LLM also acts as a reinforcement agent, optimizing QFL by adjusting optimizer steps, cutting down communication rounds, and intelligently selecting clients. Experiments show significant efficiency gains. We pioneer a synergy between LLM and QFL, offering: i) practical efficiency: Reduced communication costs and faster convergence. ii) theoretical rigor: Provable guarantees for adaptive federated optimization. iii) scalability: PEFT methods (LoRA, QLoRA) enable deployment on resource-constrained quantum devices. Code implementation is available here 1.

Title: Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Authors: Pankaj Kumar, Subhankar Mishra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18658
Pdf URL: https://arxiv.org/pdf/2505.18658
Copy Paste: [[2505.18658]] Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics(https://arxiv.org/abs/2505.18658)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.

Title: So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection

Authors: Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18660
Pdf URL: https://arxiv.org/pdf/2505.18660
Copy Paste: [[2505.18660]] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection(https://arxiv.org/abs/2505.18660)
Keywords: robust, generative
Abstract: Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.

Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization

Authors: Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18663
Pdf URL: https://arxiv.org/pdf/2505.18663
Copy Paste: [[2505.18663]] DVD-Quant: Data-free Video Diffusion Transformers Quantization(https://arxiv.org/abs/2505.18663)
Keywords: diffusion, data-free, transformer
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at this https URL.

Title: Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

Authors: Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18672
Pdf URL: https://arxiv.org/pdf/2505.18672
Copy Paste: [[2505.18672]] Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?(https://arxiv.org/abs/2505.18672)
Keywords: robust, large language model
Abstract: Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Title: Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Authors: Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18673
Pdf URL: https://arxiv.org/pdf/2505.18673
Copy Paste: [[2505.18673]] Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models(https://arxiv.org/abs/2505.18673)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at this https URL.

Title: Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model

Authors: Peng Xiao, Hongbo Zhao, Yijun Wang, Jianxin Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18674
Pdf URL: https://arxiv.org/pdf/2505.18674
Copy Paste: [[2505.18674]] Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model(https://arxiv.org/abs/2505.18674)
Keywords: diffusion, generative
Abstract: Restoring real-world degraded images, such as old photographs or low-resolution images, presents a significant challenge due to the complex, mixed degradations they exhibit, such as scratches, color fading, and noise. Recent data-driven approaches have struggled with two main challenges: achieving high-fidelity restoration and providing object-level control over colorization. While diffusion models have shown promise in generating high-quality images with specific controls, they often fail to fully preserve image details during restoration. In this work, we propose an internal detail-preserving diffusion model for high-fidelity restoration of real-world degraded images. Our method utilizes a pre-trained Stable Diffusion model as a generative prior, eliminating the need to train a model from scratch. Central to our approach is the Internal Image Detail Enhancement (IIDE) technique, which directs the diffusion model to preserve essential structural and textural information while mitigating degradation effects. The process starts by mapping the input image into a latent space, where we inject the diffusion denoising process with degradation operations that simulate the effects of various degradation factors. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models in both qualitative assessments and perceptual quantitative evaluations. Additionally, our approach supports text-guided restoration, enabling object-level colorization control that mimics the expertise of professional photo editing.

Title: Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Authors: Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18675
Pdf URL: https://arxiv.org/pdf/2505.18675
Copy Paste: [[2505.18675]] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps(https://arxiv.org/abs/2505.18675)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

Title: $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models

Authors: Yuanhe Zhang, Xinyue Wang, Haoran Gao, Zhenhong Zhou, Fanyu Meng, Yuyao Zhang, Sen Su
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18680
Pdf URL: https://arxiv.org/pdf/2505.18680
Copy Paste: [[2505.18680]] $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models(https://arxiv.org/abs/2505.18680)
Keywords: security, defense, attack, large language model
Abstract: Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework ($PD^3F$), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing resource usage induced by malicious attacks under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which terminates excessive malicious generation early. Experiments across six models demonstrate that $PD^3F$ significantly mitigates resource consumption attacks, improving users' access capacity by up to 500% during adversarial load. $PD^3F$ represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.

Title: TULUN: Transparent and Adaptable Low-resource Machine Translation

Authors: Raphaël Merx, Hanna Suominen, Lois Hong, Nick Thieberger, Trevor Cohn, Ekaterina Vylomova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18683
Pdf URL: https://arxiv.org/pdf/2505.18683
Copy Paste: [[2505.18683]] TULUN: Transparent and Adaptable Low-resource Machine Translation(https://arxiv.org/abs/2505.18683)
Keywords: large language model
Abstract: Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.

Title: From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Authors: Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18685
Pdf URL: https://arxiv.org/pdf/2505.18685
Copy Paste: [[2505.18685]] From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation(https://arxiv.org/abs/2505.18685)
Keywords: generative
Abstract: Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

Title: WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

Authors: Yang Liu, Silin Cheng, Xinwei He, Sebastien Ourselin, Lei Tan, Gen Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18686
Pdf URL: https://arxiv.org/pdf/2505.18686
Copy Paste: [[2505.18686]] WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation(https://arxiv.org/abs/2505.18686)
Keywords: segmentation
Abstract: Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at this https URL.

Title: Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Authors: Aleksandr Tsymbalov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18688
Pdf URL: https://arxiv.org/pdf/2505.18688
Copy Paste: [[2505.18688]] Large Language Models in the Task of Automatic Validation of Text Classifier Predictions(https://arxiv.org/abs/2505.18688)
Keywords: large language model
Abstract: Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.

Title: Benchmarking and Rethinking Knowledge Editing for Large Language Models

Authors: Guoxiu He, Xin Song, Futing Wang, Aixin Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18690
Pdf URL: https://arxiv.org/pdf/2505.18690
Copy Paste: [[2505.18690]] Benchmarking and Rethinking Knowledge Editing for Large Language Models(https://arxiv.org/abs/2505.18690)
Keywords: robust, large language model
Abstract: Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.

Title: Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study

Authors: Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Jeffrey Xu Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18697
Pdf URL: https://arxiv.org/pdf/2505.18697
Copy Paste: [[2505.18697]] Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study(https://arxiv.org/abs/2505.18697)
Keywords: large language model
Abstract: Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: this https URL.

Title: MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Authors: Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18698
Pdf URL: https://arxiv.org/pdf/2505.18698
Copy Paste: [[2505.18698]] MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention(https://arxiv.org/abs/2505.18698)
Keywords: transformer
Abstract: Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at this https URL.

Title: GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Authors: Chun Wang, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18700
Pdf URL: https://arxiv.org/pdf/2505.18700
Copy Paste: [[2505.18700]] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains(https://arxiv.org/abs/2505.18700)
Keywords: robust, extraction, explainability
Abstract: Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at this https URL.

Title: Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task

Authors: Gaurav Negi, Dhairya Dalal, Omnia Zayed, Paul Buitelaar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18703
Pdf URL: https://arxiv.org/pdf/2505.18703
Copy Paste: [[2505.18703]] Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task(https://arxiv.org/abs/2505.18703)
Keywords: extraction, generative
Abstract: This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.

Title: Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer

Authors: Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18713
Pdf URL: https://arxiv.org/pdf/2505.18713
Copy Paste: [[2505.18713]] Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer(https://arxiv.org/abs/2505.18713)
Keywords: robust
Abstract: Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: this https URL.

Title: Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Authors: Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18720
Pdf URL: https://arxiv.org/pdf/2505.18720
Copy Paste: [[2505.18720]] Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization(https://arxiv.org/abs/2505.18720)
Keywords: interpretability, large language model
Abstract: Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at this https URL.}.

Title: LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

Authors: Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18724
Pdf URL: https://arxiv.org/pdf/2505.18724
Copy Paste: [[2505.18724]] LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning(https://arxiv.org/abs/2505.18724)
Keywords: large language model
Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14\%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.

Title: FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

Authors: Xiaohe Li, Pengfei Li, Zide Fan, Ying Geng, Fangli Mou, Haohua Wu, Yunping Ge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18727
Pdf URL: https://arxiv.org/pdf/2505.18727
Copy Paste: [[2505.18727]] FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment(https://arxiv.org/abs/2505.18727)
Keywords: robust
Abstract: Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.

Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Authors: Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18730
Pdf URL: https://arxiv.org/pdf/2505.18730
Copy Paste: [[2505.18730]] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation(https://arxiv.org/abs/2505.18730)
Keywords: large language model
Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in this https URL.

Title: Rethinking Direct Preference Optimization in Diffusion Models

Authors: Junyong Kang, Seohyun Lim, Kyungjune Baek, Hyunjung Shim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18736
Pdf URL: https://arxiv.org/pdf/2505.18736
Copy Paste: [[2505.18736]] Rethinking Direct Preference Optimization in Diffusion Models(https://arxiv.org/abs/2505.18736)
Keywords: diffusion, large language model
Abstract: Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge. While recent advances in this area have extended preference optimization techniques from large language models (LLMs) to the diffusion setting, they often struggle with limited exploration. In this work, we propose a novel and orthogonal approach to enhancing diffusion-based preference optimization. First, we introduce a stable reference model update strategy that relaxes the frozen reference model, encouraging exploration while maintaining a stable optimization anchor through reference model regularization. Second, we present a timestep-aware training strategy that mitigates the reward scale imbalance problem across timesteps. Our method can be integrated into various preference optimization algorithms. Experimental results show that our approach improves the performance of state-of-the-art methods on human preference evaluation benchmarks.

Title: AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

Authors: Haonan Dong, Wenhao Zhu, Guojie Song, Liang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18738
Pdf URL: https://arxiv.org/pdf/2505.18738
Copy Paste: [[2505.18738]] AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping(https://arxiv.org/abs/2505.18738)
Keywords: robust
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method validated across NLP and CV domains. However, LoRA faces an inherent low-rank bottleneck: narrowing its performance gap with full finetuning requires increasing the rank of its parameter matrix, resulting in significant parameter overhead. Recent linear LoRA variants have attempted to enhance expressiveness by introducing additional linear mappings; however, their composition remains inherently linear and fails to fundamentally improve LoRA's representational capacity. To address this limitation, we propose AuroRA, which incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors to capture fixed and learnable nonlinearities. This combination forms an MLP-like structure with a compressed rank, enabling flexible and precise approximation of diverse target functions while theoretically guaranteeing lower approximation errors and bounded gradients. Extensive experiments on 22 datasets and 6 pretrained models demonstrate that AuroRA: (I) not only matches or surpasses full fine-tuning performance with only 6.18% ~ 25% of LoRA's parameters but also (II) outperforms state-of-the-art PEFT methods by up to 10.88% in both NLP and CV tasks, and (III) exhibits robust performance across various rank configurations.

Title: LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges

Authors: Tao Liu, Hongying Zan, Yifan Li, Dixuan Zhang, Lulu Kong, Haixin Liu, Jiaming Hou, Aoze Zheng, Rui Li, Yiming Qiao, Zewei Luo, Qi Wang, Zhiqiang Zhang, Jiaxi Li, Supeng Liu, Kunli Zhang, Min Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18744
Pdf URL: https://arxiv.org/pdf/2505.18744
Copy Paste: [[2505.18744]] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges(https://arxiv.org/abs/2505.18744)
Keywords: robust
Abstract: Text-to-SQL is a fundamental task in natural language processing that seeks to translate natural language questions into meaningful and executable SQL queries. While existing datasets are extensive and primarily focus on business scenarios and operational logic, they frequently lack coverage of domain-specific knowledge and complex mathematical reasoning. To address this gap, we present a novel dataset tailored for complex reasoning and chain-of-thought analysis in SQL inference, encompassing physical, arithmetic, commonsense, and hypothetical reasoning. The dataset consists of 4,038 English questions, each paired with a unique SQL query and accompanied by 12,114 step-by-step reasoning annotations, spanning 45 databases across diverse domains. Experimental results demonstrate that LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%. Incorporating our chain-of-thought annotations boosts performance to 33.96%. Benchmarking leading public methods on Spider and BIRD further underscores the unique challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-SQL systems. We have released our dataset code at this https URL.

Title: Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning

Authors: Haolin Yang, Hakaze Cho, Yiqiao Zhong, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18752
Pdf URL: https://arxiv.org/pdf/2505.18752
Copy Paste: [[2505.18752]] Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning(https://arxiv.org/abs/2505.18752)
Keywords: large language model
Abstract: The unusual properties of in-context learning (ICL) have prompted investigations into the internal mechanisms of large language models. Prior work typically focuses on either special attention heads or task vectors at specific layers, but lacks a unified framework linking these components to the evolution of hidden states across layers that ultimately produce the model's output. In this paper, we propose such a framework for ICL in classification tasks by analyzing two geometric factors that govern performance: the separability and alignment of query hidden states. A fine-grained analysis of layer-wise dynamics reveals a striking two-stage mechanism: separability emerges in early layers, while alignment develops in later layers. Ablation studies further show that Previous Token Heads drive separability, while Induction Heads and task vectors enhance alignment. Our findings thus bridge the gap between attention heads and task vectors, offering a unified account of ICL's underlying mechanisms.

Title: Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Authors: Elsen Ronando, Sozo Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18754
Pdf URL: https://arxiv.org/pdf/2505.18754
Copy Paste: [[2505.18754]] Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection(https://arxiv.org/abs/2505.18754)
Keywords: robust, large language model
Abstract: In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13$\pm$10.71%, outperforming both random selection (59.30$\pm$10.13%) and distance-only filtering (67.61$\pm$11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.

Title: Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation

Authors: Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2505.18755
Pdf URL: https://arxiv.org/pdf/2505.18755
Copy Paste: [[2505.18755]] Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation(https://arxiv.org/abs/2505.18755)
Keywords: attack, robust, fair, transformer
Abstract: With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, we propose an efficient ETD method that accurately identifies fraudulent behaviors in residential PV generation, thus ensuring the supply-demand balance in smart cities. Our hybrid deep learning model, combining multi-scale Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, excels in capturing both short-term and long-term temporal dependencies. Additionally, we introduce a data embedding technique that seamlessly integrates time-series data with discrete temperature variables, enhancing detection robustness. Extensive simulation experiments using real-world data validate the effectiveness of our approach, demonstrating significant improvements in the accuracy of detecting sophisticated energy theft activities, thereby contributing to the stability and fairness of energy systems in smart cities.

Title: ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models

Authors: Duo Li, Zuhao Yang, Shijian Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18757
Pdf URL: https://arxiv.org/pdf/2505.18757
Copy Paste: [[2505.18757]] ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models(https://arxiv.org/abs/2505.18757)
Keywords: robust, large language model
Abstract: The representation of visual inputs of large vision-language models (LVLMs) usually involves substantially more tokens than that of textual inputs, leading to significant computational overhead. Several recent studies strive to mitigate this issue by either conducting token compression to prune redundant visual tokens or guiding them to bypass certain computational stages. While most existing work exploits token importance as the redundancy indicator, our study reveals that two largely neglected factors, namely, the diversity of retained visual tokens and their task relevance, often offer more robust criteria in token pruning. To this end, we design ToDRE, a two-stage and training-free token compression framework that achieves superior performance by pruning Tokens based on token Diversity and token-task RElevance. Instead of pruning redundant tokens, ToDRE introduces a greedy k-center algorithm to select and retain a small subset of diverse visual tokens after the vision encoder. Additionally, ToDRE addresses the "information migration" by further eliminating task-irrelevant visual tokens within the decoder of large language model (LLM). Extensive experiments show that ToDRE effectively reduces 90% of visual tokens after vision encoder and adaptively prunes all visual tokens within certain LLM's decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.1% of model performance and excellent compatibility with efficient attention operators.

Title: ARMS: A Vision for Actor Reputation Metric Systems in the Open-Source Software Supply Chain

Authors: Kelechi G. Kalu, Sofia Okorafor, Betül Durak, Kim Laine, Radames C. Moreno, Santiago Torres-Arias, James C. Davis
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2505.18760
Pdf URL: https://arxiv.org/pdf/2505.18760
Copy Paste: [[2505.18760]] ARMS: A Vision for Actor Reputation Metric Systems in the Open-Source Software Supply Chain(https://arxiv.org/abs/2505.18760)
Keywords: security
Abstract: Many critical information technology and cyber-physical systems rely on a supply chain of open-source software projects. OSS project maintainers often integrate contributions from external actors. While maintainers can assess the correctness of a change request, assessing a change request's cybersecurity implications is challenging. To help maintainers make this decision, we propose that the open-source ecosystem should incorporate Actor Reputation Metrics (ARMS). This capability would enable OSS maintainers to assess a prospective contributor's cybersecurity reputation. To support the future instantiation of ARMS, we identify seven generic security signals from industry standards; map concrete metrics from prior work and available security tools, describe study designs to refine and assess the utility of ARMS, and finally weigh its pros and cons.

Title: How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Authors: Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, Liangming Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18761
Pdf URL: https://arxiv.org/pdf/2505.18761
Copy Paste: [[2505.18761]] How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark(https://arxiv.org/abs/2505.18761)
Keywords: robust, large language model
Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Title: GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18763
Pdf URL: https://arxiv.org/pdf/2505.18763
Copy Paste: [[2505.18763]] GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning(https://arxiv.org/abs/2505.18763)
Keywords: diffusion, generative
Abstract: Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

Title: Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization

Authors: Dai Hai Nguyen, Hiroshi Mamitsuka, Atsuyoshi Nakamura
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18765
Pdf URL: https://arxiv.org/pdf/2505.18765
Copy Paste: [[2505.18765]] Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization(https://arxiv.org/abs/2505.18765)
Keywords: generative
Abstract: We address the optimization problem of simultaneously minimizing multiple objective functionals over a family of probability distributions. This type of Multi-Objective Distributional Optimization commonly arises in machine learning and statistics, with applications in areas such as multiple target sampling, multi-task learning, and multi-objective generative modeling. To solve this problem, we propose an iterative particle-based algorithm, which we call Muliple Wasserstein Gradient Descent (MWGraD), which constructs a flow of intermediate empirical distributions, each being represented by a set of particles, which gradually minimize the multiple objective functionals simultaneously. Specifically, MWGraD consists of two key steps at each iteration. First, it estimates the Wasserstein gradient for each objective functional based on the current particles. Then, it aggregates these gradients into a single Wasserstein gradient using dynamically adjusted weights and updates the particles accordingly. In addition, we provide theoretical analysis and present experimental results on both synthetic and real-world datasets, demonstrating the effectiveness of MWGraD.

Title: StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations

Authors: Yanjie Li, Wenxuan Zhang, Xinqi Lyu, Yihao Liu, Bin Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18766
Pdf URL: https://arxiv.org/pdf/2505.18766
Copy Paste: [[2505.18766]] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations(https://arxiv.org/abs/2505.18766)
Keywords: protect, defense, attack, robust, diffusion
Abstract: Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion.

Title: Dual-Path Stable Soft Prompt Generation for Domain Generalization

Authors: Yuedi Zhang, Shuanghao Bai, Wanqi Zhou, Zhirong Luan, Badong Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18770
Pdf URL: https://arxiv.org/pdf/2505.18770
Copy Paste: [[2505.18770]] Dual-Path Stable Soft Prompt Generation for Domain Generalization(https://arxiv.org/abs/2505.18770)
Keywords: robust, transformer
Abstract: Domain generalization (DG) aims to learn a model using data from one or multiple related but distinct source domains that can generalize well to unseen out-of-distribution target domains. Inspired by the success of large pre-trained vision-language models (VLMs), prompt tuning has emerged as an effective generalization strategy. However, it often struggles to capture domain-specific features due to its reliance on manually or fixed prompt inputs. Recently, some prompt generation methods have addressed this limitation by dynamically generating instance-specific and domain-specific prompts for each input, enriching domain information and demonstrating potential for enhanced generalization. Through further investigation, we identify a notable issue in existing prompt generation methods: the same input often yields significantly different and suboptimal prompts across different random seeds, a phenomenon we term Prompt Variability. To address this, we introduce negative learning into the prompt generation process and propose Dual-Path Stable Soft Prompt Generation (DPSPG), a transformer-based framework designed to improve both the stability and generalization of prompts. Specifically, DPSPG incorporates a complementary prompt generator to produce negative prompts, thereby reducing the risk of introducing misleading information. Both theoretical and empirical analyses demonstrate that negative learning leads to more robust and effective prompts by increasing the effective margin and reducing the upper bound of the gradient norm. Extensive experiments on five DG benchmark datasets show that DPSPG consistently outperforms state-of-the-art methods while maintaining prompt stability.

Title: Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

Authors: Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, Milad Nasr, Sahra Ghalebikesabi, Niloofar Mireshghallah, Meenatchi Sundaram Mutu Selva Annamalai, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18773
Pdf URL: https://arxiv.org/pdf/2505.18773
Copy Paste: [[2505.18773]] Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models(https://arxiv.org/abs/2505.18773)
Keywords: privacy, attack, membership infer, large language model
Abstract: State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.

Title: Disentangling Knowledge Representations for Large Language Model Editing

Authors: Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18774
Pdf URL: https://arxiv.org/pdf/2505.18774
Copy Paste: [[2505.18774]] Disentangling Knowledge Representations for Large Language Model Editing(https://arxiv.org/abs/2505.18774)
Keywords: large language model
Abstract: Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.

Title: OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Authors: Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18775
Pdf URL: https://arxiv.org/pdf/2505.18775
Copy Paste: [[2505.18775]] OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks(https://arxiv.org/abs/2505.18775)
Keywords: generative
Abstract: Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their this http URL and data are available at this https URL.

Title: HD-PiSSA: High-Rank Distributed Orthogonal Adaptation

Authors: Yiding Wang, Fauxu meng, Xuefeng Zhang, Fan Jiang, Pingzhi Tang, Muhan Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18777
Pdf URL: https://arxiv.org/pdf/2505.18777
Copy Paste: [[2505.18777]] HD-PiSSA: High-Rank Distributed Orthogonal Adaptation(https://arxiv.org/abs/2505.18777)
Keywords: large language model
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.

Title: Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains

Authors: Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Praveen Chandrashekar, Siddhartha Mishra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18781
Pdf URL: https://arxiv.org/pdf/2505.18781
Copy Paste: [[2505.18781]] Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains(https://arxiv.org/abs/2505.18781)
Keywords: robust, transformer
Abstract: The very challenging task of learning solution operators of PDEs on arbitrary domains accurately and efficiently is of vital importance to engineering and industrial simulations. Despite the existence of many operator learning algorithms to approximate such PDEs, we find that accurate models are not necessarily computationally efficient and vice versa. We address this issue by proposing a geometry aware operator transformer (GAOT) for learning PDEs on arbitrary domains. GAOT combines novel multiscale attentional graph neural operator encoders and decoders, together with geometry embeddings and (vision) transformer processors to accurately map information about the domain and the inputs into a robust approximation of the PDE solution. Multiple innovations in the implementation of GAOT also ensure computational efficiency and scalability. We demonstrate this significant gain in both accuracy and efficiency of GAOT over several baselines on a large number of learning tasks from a diverse set of PDEs, including achieving state of the art performance on a large scale three-dimensional industrial CFD dataset.

Title: Soft Weighted Machine Unlearning

Authors: Xinbao Qiao, Ningning Ding, Yushi Cheng, Meng Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18783
Pdf URL: https://arxiv.org/pdf/2505.18783
Copy Paste: [[2505.18783]] Soft Weighted Machine Unlearning(https://arxiv.org/abs/2505.18783)
Keywords: privacy, robust, fair
Abstract: Machine unlearning, as a post-hoc processing technique, has gained widespread adoption in addressing challenges like bias mitigation and robustness enhancement, colloquially, machine unlearning for fairness and robustness. However, existing non-privacy unlearning-based solutions persist in using binary data removal framework designed for privacy-driven motivation, leading to significant information loss, a phenomenon known as over-unlearning. While over-unlearning has been largely described in many studies as primarily causing utility degradation, we investigate its fundamental causes and provide deeper insights in this work through counterfactual leave-one-out analysis. In this paper, we introduce a weighted influence function that assigns tailored weights to each sample by solving a convex quadratic programming problem analytically. Building on this, we propose a soft-weighted framework enabling fine-grained model adjustments to address the over-unlearning challenge. We demonstrate that the proposed soft-weighted scheme is versatile and can be seamlessly integrated into most existing unlearning algorithms. Extensive experiments show that in fairness- and robustness-driven tasks, the soft-weighted scheme significantly outperforms hard-weighted schemes in fairness/robustness metrics and alleviates the decline in utility metric, thereby enhancing machine unlearning algorithm as an effective correction solution.

Title: Leveraging Per-Instance Privacy for Machine Unlearning

Authors: Nazanin Mohammadi Sepahvand, Anvith Thudi, Berivan Isik, Ashmita Bhattacharyya, Nicolas Papernot, Eleni Triantafillou, Daniel M. Roy, Gintare Karolina Dziugaite
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18786
Pdf URL: https://arxiv.org/pdf/2505.18786
Copy Paste: [[2505.18786]] Leveraging Per-Instance Privacy for Machine Unlearning(https://arxiv.org/abs/2505.18786)
Keywords: privacy
Abstract: We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility-unlearning tradeoff by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (Renyi) divergence to retraining without an individual data point. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.

Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

Authors: Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18799
Pdf URL: https://arxiv.org/pdf/2505.18799
Copy Paste: [[2505.18799]] ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models(https://arxiv.org/abs/2505.18799)
Keywords: large language model
Abstract: Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10\%} of attention parameters during fine-tuning while achieving a \textbf{2\%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.

Title: Mal-D2GAN: Double-Detector based GAN for Malware Generation

Authors: Nam Hoang Thanh, Trung Pham Duy, Lam Bui Thu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18806
Pdf URL: https://arxiv.org/pdf/2505.18806
Copy Paste: [[2505.18806]] Mal-D2GAN: Double-Detector based GAN for Malware Generation(https://arxiv.org/abs/2505.18806)
Keywords: attack, robust
Abstract: Machine learning (ML) has been developed to detect malware in recent years. Most researchers focused their efforts on improving the detection performance but ignored the robustness of the ML models. In addition, many machine learning algorithms are very vulnerable to intentional attacks. To solve these problems, adversarial malware examples are generated by GANs to enhance the robustness of the malware detector. However, since current GAN models suffer from limitations such as unstable training and weak adversarial examples, we propose the Mal-D2GAN model to address these problems. Specifically, the Mal-D2GAN architecture was designed with double-detector and a least square loss function and tested on a dataset of 20,000 samples. The results show that the Mal-D2GAN model reduced the detection accuracy (true positive rate) in 8 malware detectors. The performance was then compared with that of the existing MalGAN and Mal- LSGAN models.

Title: VORTA: Efficient Video Diffusion via Routing Sparse Attention

Authors: Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18809
Pdf URL: https://arxiv.org/pdf/2505.18809
Copy Paste: [[2505.18809]] VORTA: Efficient Video Diffusion via Routing Sparse Attention(https://arxiv.org/abs/2505.18809)
Keywords: diffusion, transformer
Abstract: Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.

Title: SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Authors: Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18812
Pdf URL: https://arxiv.org/pdf/2505.18812
Copy Paste: [[2505.18812]] SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models(https://arxiv.org/abs/2505.18812)
Keywords: large language model
Abstract: Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.

Title: Usability of Token-based and Remote Electronic Signatures: A User Experience Study

Authors: Omer Ege, Mustafa Cagal, Kemal Bicakci
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2505.18814
Pdf URL: https://arxiv.org/pdf/2505.18814
Copy Paste: [[2505.18814]] Usability of Token-based and Remote Electronic Signatures: A User Experience Study(https://arxiv.org/abs/2505.18814)
Keywords: secure, security, protect
Abstract: As electronic signatures (e-signatures) become increasingly integral to secure digital transactions, understanding their usability and security perception from an end-user perspective has become crucial. This study empirically evaluates and compares two major e-signature systems -- token-based and remote signatures -- through a controlled user experience study with 20 participants. Participants completed tasks involving acquisition, installation, and document signing using both methods, followed by structured surveys and qualitative feedback. Statistical analyses revealed that remote e-signatures were perceived as significantly more usable than token-based ones, due to their minimal setup and platform-independent accessibility. In contrast, token-based signatures were rated as significantly more secure, highlighting users' trust in hardware-based protection. Although more participants preferred remote e-signatures for document signing, the preference did not reach statistical significance, indicating a trend toward favoring convenience in real-world scenarios. These findings underline the fundamental trade-off between usability and perceived security in digital signing systems. By bridging the gap between theoretical frameworks and real user experience, this study contributes valuable insights to the design and policymaking of qualified electronic signature solutions.

Title: Reasoning Segmentation for Images and Videos: A Survey

Authors: Yiqing Shen, Chenjia Li, Fei Xiong, Jeong-O Jeong, Tianpeng Wang, Michael Latman, Mathias Unberath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18816
Pdf URL: https://arxiv.org/pdf/2505.18816
Copy Paste: [[2505.18816]] Reasoning Segmentation for Images and Videos: A Survey(https://arxiv.org/abs/2505.18816)
Keywords: segmentation
Abstract: Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.

Title: MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

Authors: Libin Lan, Yanxin Li, Xiaojuan Liu, Juan Zhou, Jianxun Zhang, Nannan Huang, Yudong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18823
Pdf URL: https://arxiv.org/pdf/2505.18823
Copy Paste: [[2505.18823]] MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation(https://arxiv.org/abs/2505.18823)
Keywords: robust, transformer, segmentation
Abstract: Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at this https URL.

Title: How to build a consistency model: Learning flow maps via self-distillation

Authors: Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18825
Pdf URL: https://arxiv.org/pdf/2505.18825
Copy Paste: [[2505.18825]] How to build a consistency model: Learning flow maps via self-distillation(https://arxiv.org/abs/2505.18825)
Keywords: diffusion, generative
Abstract: Building on the framework proposed in Boffi et al. (2024), we present a systematic approach for learning flow maps associated with flow and diffusion models. Flow map-based models, commonly known as consistency models, encompass recent efforts to improve the efficiency of generative models based on solutions to differential equations. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert existing distillation schemes into direct training algorithms via self-distillation, eliminating the need for pre-trained models. We empirically evaluate several instantiations of our framework, finding that high-dimensional tasks like image synthesis benefit from objective functions that avoid temporal and spatial derivatives of the flow map, while lower-dimensional tasks can benefit from objectives incorporating higher-order derivatives to capture sharp features.

Title: On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Authors: Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18830
Pdf URL: https://arxiv.org/pdf/2505.18830
Copy Paste: [[2505.18830]] On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization(https://arxiv.org/abs/2505.18830)
Keywords: large language model
Abstract: Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Title: Localizing Knowledge in Diffusion Transformers

Authors: Arman Zarei, Samyadeep Basu, Keivan Rezaei, Zihao Lin, Sayan Nag, Soheil Feizi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18832
Pdf URL: https://arxiv.org/pdf/2505.18832
Copy Paste: [[2505.18832]] Localizing Knowledge in Diffusion Transformers(https://arxiv.org/abs/2505.18832)
Keywords: interpretability, diffusion, transformer, generative
Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.

Title: Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Authors: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18842
Pdf URL: https://arxiv.org/pdf/2505.18842
Copy Paste: [[2505.18842]] Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation(https://arxiv.org/abs/2505.18842)
Keywords: large language model
Abstract: We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.

Title: Multi-Party Conversational Agents: A Survey

Authors: Sagar Sapkota, Mohammad Saqib Hasan, Mubarak Shah, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18845
Pdf URL: https://arxiv.org/pdf/2505.18845
Copy Paste: [[2505.18845]] Multi-Party Conversational Agents: A Survey(https://arxiv.org/abs/2505.18845)
Keywords: large language model
Abstract: Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants' mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.

Title: LLM-Driven APT Detection for 6G Wireless Networks: A Systematic Review and Taxonomy

Authors: Muhammed Golec, Yaser Khamayseh, Suhib Bani Melhem, Abdulmalik Alwarafy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18846
Pdf URL: https://arxiv.org/pdf/2505.18846
Copy Paste: [[2505.18846]] LLM-Driven APT Detection for 6G Wireless Networks: A Systematic Review and Taxonomy(https://arxiv.org/abs/2505.18846)
Keywords: steal, explainability, large language model
Abstract: Sixth Generation (6G) wireless networks, which are expected to be deployed in the 2030s, have already created great excitement in academia and the private sector with their extremely high communication speed and low latency rates. However, despite the ultra-low latency, high throughput, and AI-assisted orchestration capabilities they promise, they are vulnerable to stealthy and long-term Advanced Persistent Threats (APTs). Large Language Models (LLMs) stand out as an ideal candidate to fill this gap with their high success in semantic reasoning and threat intelligence. In this paper, we present a comprehensive systematic review and taxonomy study for LLM-assisted APT detection in 6G networks. We address five research questions, namely, semantic merging of fragmented logs, encrypted traffic analysis, edge distribution constraints, dataset/modeling techniques, and reproducibility trends, by leveraging most recent studies on the intersection of LLMs, APTs, and 6G wireless networks. We identify open challenges such as explainability gaps, data scarcity, edge hardware limitations, and the need for real-time slicing-aware adaptation by presenting various taxonomies such as granularity, deployment models, and kill chain stages. We then conclude the paper by providing several research gaps in 6G infrastructures for future researchers. To the best of our knowledge, this paper is the first comprehensive systematic review and classification study on LLM-based APT detection in 6G networks.

Title: Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Authors: Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18853
Pdf URL: https://arxiv.org/pdf/2505.18853
Copy Paste: [[2505.18853]] Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation(https://arxiv.org/abs/2505.18853)
Keywords: diffusion
Abstract: Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at this https URL.

Title: Writing Like the Best: Exemplar-Based Expository Text Generation

Authors: Yuxiang Liu, Kevin Chen-Chuan Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18859
Pdf URL: https://arxiv.org/pdf/2505.18859
Copy Paste: [[2505.18859]] Writing Like the Best: Exemplar-Based Expository Text Generation(https://arxiv.org/abs/2505.18859)
Keywords: large language model
Abstract: We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics--imitativeness, adaptiveness, and adaptive-imitativeness--using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.

Title: Securing Credit Inquiries: The Role of Real-Time User Approval in Preventing SSN Identity Theft

Authors: Gogulakrishnan Thiyagarajan, Vinay Bist, Prabhudarshi Nayak
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18861
Pdf URL: https://arxiv.org/pdf/2505.18861
Copy Paste: [[2505.18861]] Securing Credit Inquiries: The Role of Real-Time User Approval in Preventing SSN Identity Theft(https://arxiv.org/abs/2505.18861)
Keywords: security, protect, fair
Abstract: Unauthorized credit inquiries are also a central entry point for identity theft, with Social Security Numbers (SSNs) being widely utilized in fraudulent cases. Traditional credit inquiry systems do not usually possess strict user authentication, making them vulnerable to unauthorized access. This paper proposes a real-time user authorization system to enhance security by enforcing explicit user approval before processing any credit inquiry. The system employs real-time verification and approval techniques. This ensures that the authorized user only approves or rejects a credit check request. It minimizes the risks of interference by third parties. Apart from enhancing security, this system complies with regulations like the General Data Protection Regulation (GDPR) and the Fair Credit Reporting Act (FCRA) while maintaining a seamless user experience. This article discusses the technical issues, scaling-up issues, and ways of implementing real-time user authorization in financial systems. Through this framework, financial institutions can drastically minimize the risk of identity theft, avert unauthorized credit checks, and increase customer trust in the credit verification system.

Title: Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Authors: Binhao Ma, Hanqing Guo, Zhengping Jay Luo, Rui Duan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18864
Pdf URL: https://arxiv.org/pdf/2505.18864
Copy Paste: [[2505.18864]] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework(https://arxiv.org/abs/2505.18864)
Keywords: security, defense, attack, robust, large language model
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

Title: Distribution-Aware Mobility-Assisted Decentralized Federated Learning

Authors: Md Farhamdur Reza, Reza Jahani, Richeng Jin, Huaiyu Dai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18866
Pdf URL: https://arxiv.org/pdf/2505.18866
Copy Paste: [[2505.18866]] Distribution-Aware Mobility-Assisted Decentralized Federated Learning(https://arxiv.org/abs/2505.18866)
Keywords: federate
Abstract: Decentralized federated learning (DFL) has attracted significant attention due to its scalability and independence from a central server. In practice, some participating clients can be mobile, yet the impact of user mobility on DFL performance remains largely unexplored, despite its potential to facilitate communication and model convergence. In this work, we demonstrate that introducing a small fraction of mobile clients, even with random movement, can significantly improve the accuracy of DFL by facilitating information flow. To further enhance performance, we propose novel distribution-aware mobility patterns, where mobile clients strategically navigate the network, leveraging knowledge of data distributions and static client locations. The proposed moving strategies mitigate the impact of data heterogeneity and boost learning convergence. Extensive experiments validate the effectiveness of induced mobility in DFL and demonstrate the superiority of our proposed mobility patterns over random movement.

Title: Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing

Authors: Ming Cheng, Jiaying Gong, Hoda Eldardiry
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18867
Pdf URL: https://arxiv.org/pdf/2505.18867
Copy Paste: [[2505.18867]] Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing(https://arxiv.org/abs/2505.18867)
Keywords: large language model
Abstract: Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.

Title: Eye-See-You: Reverse Pass-Through VR and Head Avatars

Authors: Ankan Dash, Jingyi Gu, Guiling Wang, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18869
Pdf URL: https://arxiv.org/pdf/2505.18869
Copy Paste: [[2505.18869]] Eye-See-You: Reverse Pass-Through VR and Head Avatars(https://arxiv.org/abs/2505.18869)
Keywords: robust, generative
Abstract: Virtual Reality (VR) headsets, while integral to the evolving digital ecosystem, present a critical challenge: the occlusion of users' eyes and portions of their faces, which hinders visual communication and may contribute to social isolation. To address this, we introduce RevAvatar, an innovative framework that leverages AI methodologies to enable reverse pass-through technology, fundamentally transforming VR headset design and interaction paradigms. RevAvatar integrates state-of-the-art generative models and multimodal AI techniques to reconstruct high-fidelity 2D facial images and generate accurate 3D head avatars from partially observed eye and lower-face regions. This framework represents a significant advancement in AI4Tech by enabling seamless interaction between virtual and physical environments, fostering immersive experiences such as VR meetings and social engagements. Additionally, we present VR-Face, a novel dataset comprising 200,000 samples designed to emulate diverse VR-specific conditions, including occlusions, lighting variations, and distortions. By addressing fundamental limitations in current VR systems, RevAvatar exemplifies the transformative synergy between AI and next-generation technologies, offering a robust platform for enhancing human connection and interaction in virtual environments.

Title: Understanding the Relationship Between Personal Data Privacy Literacy and Data Privacy Information Sharing by University Students

Authors: Brady D. Lund, Bryan Anderson, Ana Roeschley, Gahangir Hossain
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18870
Pdf URL: https://arxiv.org/pdf/2505.18870
Copy Paste: [[2505.18870]] Understanding the Relationship Between Personal Data Privacy Literacy and Data Privacy Information Sharing by University Students(https://arxiv.org/abs/2505.18870)
Keywords: security, privacy
Abstract: With constant threats to the safety of personal data in the United States, privacy literacy has become an increasingly important competency among university students, one that ties intimately to the information sharing behavior of these students. This survey based study examines how university students in the United States perceive personal data privacy and how their privacy literacy influences their understanding and behaviors. Students responses to a privacy literacy scale were categorized into high and low privacy literacy groups, revealing that high literacy individuals demonstrate a broader range of privacy practices, including multi factor authentication, VPN usage, and phishing awareness, whereas low literacy individuals rely on more basic security measures. Statistical analyses suggest that high literacy respondents display greater diversity in recommendations and engagement in privacy discussions. These findings suggest the need for enhanced educational initiatives to improve data privacy awareness at the university level to create a better cyber safe population.

Title: Zero Trust Cybersecurity: Procedures and Considerations in Context

Authors: Brady D. Lund, Tae Hee Lee, Ziang Wang, Ting Wang, Nishith Reddy Mannuru
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.18872
Pdf URL: https://arxiv.org/pdf/2505.18872
Copy Paste: [[2505.18872]] Zero Trust Cybersecurity: Procedures and Considerations in Context(https://arxiv.org/abs/2505.18872)
Keywords: security
Abstract: In response to the increasing complexity and sophistication of cyber threats, particularly those enhanced by advancements in artificial intelligence, traditional security methods are proving insufficient. This paper explores the Zero Trust cybersecurity framework, which operates on the principle of never trust, always verify to mitigate vulnerabilities within organizations. Specifically, it examines the applicability of Zero Trust principles in environments where large volumes of information are exchanged, such as schools and libraries. The discussion highlights the importance of continuous authentication, least privilege access, and breach assumption. The findings underscore avenues for future research that may help preserve the security of these vulnerable organizations.

Title: Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18875
Pdf URL: https://arxiv.org/pdf/2505.18875
Copy Paste: [[2505.18875]] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation(https://arxiv.org/abs/2505.18875)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.

Title: RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18877
Pdf URL: https://arxiv.org/pdf/2505.18877
Copy Paste: [[2505.18877]] RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models(https://arxiv.org/abs/2505.18877)
Keywords: large language model
Abstract: Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pre-trained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18878
Pdf URL: https://arxiv.org/pdf/2505.18878
Copy Paste: [[2505.18878]] CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions(https://arxiv.org/abs/2505.18878)
Keywords: robust
Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

Title: REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Authors: Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18880
Pdf URL: https://arxiv.org/pdf/2505.18880
Copy Paste: [[2505.18880]] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing(https://arxiv.org/abs/2505.18880)
Keywords: large language model
Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

Title: Partition Generative Modeling: Masked Modeling Without Masks

Authors: Justin Deschenaux, Lan Tran, Caglar Gulcehre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18883
Pdf URL: https://arxiv.org/pdf/2505.18883
Copy Paste: [[2505.18883]] Partition Generative Modeling: Masked Modeling Without Masks(https://arxiv.org/abs/2505.18883)
Keywords: diffusion, generative
Abstract: We introduce ``Partition Generative Models'' (PGMs), a novel approach to masked generative modeling (MGMs), particularly effective for masked diffusion language modeling (MDLMs). PGM divides tokens into two distinct groups and employs sparse attention patterns to prevent cross-group information exchange. Hence, the model is trained to predict tokens in one group based solely on information from the other group. This partitioning strategy eliminates the need for MASK tokens entirely. While traditional MGMs inefficiently process MASK tokens during generation, PGMs achieve greater computational efficiency by operating exclusively on unmasked tokens. Our experiments on OpenWebText with a context length of 1024 tokens demonstrate that PGMs deliver at least 5x improvements in both latency and throughput compared to MDLM when using the same number of sampling steps, while generating samples with better generative perplexity than MDLM. Finally, we show that PGMs can be distilled with Self-Distillation Through Time (SDTT), a method originally devised for MDLM, in order to achieve further inference gains.

Title: LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders

Authors: Borna Khodabandeh, Amirabbas Afzali, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, Sanjay Lall, Sajjad Amini, Seyed-Mohsen Moosavi-Dezfooli
Subjects: cs.LG, cs.AI, cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2505.18884
Pdf URL: https://arxiv.org/pdf/2505.18884
Copy Paste: [[2505.18884]] LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders(https://arxiv.org/abs/2505.18884)
Keywords: robust, interpretability
Abstract: Visual encoders have become fundamental components in modern computer vision pipelines. However, ensuring robustness against adversarial perturbations remains a critical challenge. Recent efforts have explored both supervised and unsupervised adversarial fine-tuning strategies. We identify two key limitations in these approaches: (i) they often suffer from instability, especially during the early stages of fine-tuning, resulting in suboptimal convergence and degraded performance on clean data, and (ii) they exhibit a suboptimal trade-off between robustness and clean data accuracy, hindering the simultaneous optimization of both objectives. To overcome these challenges, we propose Lagrangian-Optimized Robust Embeddings (LORE), a novel unsupervised adversarial fine-tuning framework. LORE utilizes constrained optimization, which offers a principled approach to balancing competing goals, such as improving robustness while preserving nominal performance. By enforcing embedding-space proximity constraints, LORE effectively maintains clean data performance throughout adversarial fine-tuning. Extensive experiments show that LORE significantly improves zero-shot adversarial robustness with minimal degradation in clean data accuracy. Furthermore, we demonstrate the effectiveness of the adversarially fine-tuned CLIP image encoder in out-of-distribution generalization and enhancing the interpretability of image embeddings.

Title: KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning

Authors: Zhendong Mi, Qitao Tan, Xiaodong Yu, Zining Zhu, Geng Yuan, Shaoyi Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18886
Pdf URL: https://arxiv.org/pdf/2505.18886
Copy Paste: [[2505.18886]] KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning(https://arxiv.org/abs/2505.18886)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.

Title: Security Concerns for Large Language Models: A Survey

Authors: Miles Q. Li, Benjamin C. M. Fung
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18889
Pdf URL: https://arxiv.org/pdf/2505.18889
Copy Paste: [[2505.18889]] Security Concerns for Large Language Models: A Survey(https://arxiv.org/abs/2505.18889)
Keywords: security, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbations and data poisoning), misuse by malicious actors (e.g., for disinformation, phishing, and malware generation), and worrisome risks inherent in autonomous LLM agents. A significant focus has been recently placed on the latter, exploring goal misalignment, emergent deception, self-preservation instincts, and the potential for LLMs to develop and pursue covert, misaligned objectives (scheming), which may even persist through safety training. We summarize recent academic and industrial studies (2022-2025) that exemplify each threat, analyze proposed defenses and their limitations, and identify open challenges in securing LLM-based applications. We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.

Title: Conformal Prediction for Uncertainty Estimation in Drug-Target Interaction Prediction

Authors: Morteza Rakhshaninejad, Mira Jurgens, Nicolas Dewolf, Willem Waegeman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18890
Pdf URL: https://arxiv.org/pdf/2505.18890
Copy Paste: [[2505.18890]] Conformal Prediction for Uncertainty Estimation in Drug-Target Interaction Prediction(https://arxiv.org/abs/2505.18890)
Keywords: robust
Abstract: Accurate drug-target interaction (DTI) prediction with machine learning models is essential for drug discovery. Such models should also provide a credible representation of their uncertainty, but applying classical marginal conformal prediction (CP) in DTI prediction often overlooks variability across drug and protein subgroups. In this work, we analyze three cluster-conditioned CP methods for DTI prediction, and compare them with marginal and group-conditioned CP. Clusterings are obtained via nonconformity scores, feature similarity, and nearest neighbors, respectively. Experiments on the KIBA dataset using four data-splitting strategies show that nonconformity-based clustering yields the tightest intervals and most reliable subgroup coverage, especially in random and fully unseen drug-protein splits. Group-conditioned CP works well when one entity is familiar, but residual-driven clustering provides robust uncertainty estimates even in sparse or novel scenarios. These results highlight the potential of cluster-based CP for improving DTI prediction under uncertainty.

Title: Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

Authors: Andrea Ramazzina, Vittorio Giammarino, Matteo El-Hariry, Mario Bijelic
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.18899
Pdf URL: https://arxiv.org/pdf/2505.18899
Copy Paste: [[2505.18899]] Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos(https://arxiv.org/abs/2505.18899)
Keywords: robust
Abstract: Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method. Our code is publicly available at Eb-LAIfO.

Title: PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models

Authors: Xiaoyan Hu, Lauren Pick, Ho-fung Leung, Farzan Farnia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18901
Pdf URL: https://arxiv.org/pdf/2505.18901
Copy Paste: [[2505.18901]] PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models(https://arxiv.org/abs/2505.18901)
Keywords: generative, large language model
Abstract: The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise's effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.

Title: Federated Retrieval-Augmented Generation: A Systematic Mapping Study

Authors: Abhijit Chakraborty, Chahana Dahal, Vivek Gupta
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18906
Pdf URL: https://arxiv.org/pdf/2505.18906
Copy Paste: [[2505.18906]] Federated Retrieval-Augmented Generation: A Systematic Mapping Study(https://arxiv.org/abs/2505.18906)
Keywords: secure, privacy, federate, large language model
Abstract: Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham's guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.

Title: Behavior Injection: Preparing Language Models for Reinforcement Learning

Authors: Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18917
Pdf URL: https://arxiv.org/pdf/2505.18917
Copy Paste: [[2505.18917]] Behavior Injection: Preparing Language Models for Reinforcement Learning(https://arxiv.org/abs/2505.18917)
Keywords: large language model
Abstract: Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increases the performance gain from RFT over the pre-RL model.

Title: Graph-Based Operator Learning from Limited Data on Irregular Domains

Authors: Yile Li, Shandian Zhe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18923
Pdf URL: https://arxiv.org/pdf/2505.18923
Copy Paste: [[2505.18923]] Graph-Based Operator Learning from Limited Data on Irregular Domains(https://arxiv.org/abs/2505.18923)
Keywords: diffusion
Abstract: Operator learning seeks to approximate mappings from input functions to output solutions, particularly in the context of partial differential equations (PDEs). While recent advances such as DeepONet and Fourier Neural Operator (FNO) have demonstrated strong performance, they often rely on regular grid discretizations, limiting their applicability to complex or irregular domains. In this work, we propose a Graph-based Operator Learning with Attention (GOLA) framework that addresses this limitation by constructing graphs from irregularly sampled spatial points and leveraging attention-enhanced Graph Neural Netwoks (GNNs) to model spatial dependencies with global information. To improve the expressive capacity, we introduce a Fourier-based encoder that projects input functions into a frequency space using learnable complex coefficients, allowing for flexible embeddings even with sparse or nonuniform samples. We evaluated our approach across a range of 2D PDEs, including Darcy Flow, Advection, Eikonal, and Nonlinear Diffusion, under varying sampling densities. Our method consistently outperforms baselines, particularly in data-scarce regimes, demonstrating strong generalization and efficiency on irregular domains.

Title: LLM-Guided Taxonomy and Hierarchical Uncertainty for 3D Point CLoud Active Learning

Authors: Chenxi Li, Nuo Chen, Fengyun Tan, Yantong Chen, Bochun Yuan, Tianrui Li, Chongshou Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18924
Pdf URL: https://arxiv.org/pdf/2505.18924
Copy Paste: [[2505.18924]] LLM-Guided Taxonomy and Hierarchical Uncertainty for 3D Point CLoud Active Learning(https://arxiv.org/abs/2505.18924)
Keywords: large language model, segmentation
Abstract: We present a novel active learning framework for 3D point cloud semantic segmentation that, for the first time, integrates large language models (LLMs) to construct hierarchical label structures and guide uncertainty-based sample selection. Unlike prior methods that treat labels as flat and independent, our approach leverages LLM prompting to automatically generate multi-level semantic taxonomies and introduces a recursive uncertainty projection mechanism that propagates uncertainty across hierarchy levels. This enables spatially diverse, label-aware point selection that respects the inherent semantic structure of 3D scenes. Experiments on S3DIS and ScanNet v2 show that our method achieves up to 4% mIoU improvement under extremely low annotation budgets (e.g., 0.02%), substantially outperforming existing baselines. Our results highlight the untapped potential of LLMs as knowledge priors in 3D vision and establish hierarchical uncertainty modeling as a powerful paradigm for efficient point cloud annotation.

Title: Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation

Authors: Ross Greer, Alisha Ukani, Katherine Izhikevich, Earlence Fernandes, Stefan Savage, Alex C. Snoeren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18925
Pdf URL: https://arxiv.org/pdf/2505.18925
Copy Paste: [[2505.18925]] Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation(https://arxiv.org/abs/2505.18925)
Keywords: privacy, robust
Abstract: Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies. However, these approaches often require access to the original document images, which may not always be available due to privacy, storage, or transmission constraints. This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation. By utilizing the spatial positions and textual content of OCR-detected words, our method enables document alignment without relying on pixel-level image data. This technique is particularly valuable in scenarios where only OCR outputs are accessible. Furthermore, the method is robust to OCR noise, incorporating RANSAC to handle outliers and inaccuracies in the OCR data. On a set of test documents, we demonstrate that our OCR-based approach even performs more accurately than traditional image-based methods, offering a more efficient and scalable solution for document registration tasks. The proposed method facilitates applications in document processing, all while reducing reliance on high-dimensional image data.

Title: Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time

Authors: Jingxuan Xu, Hong Huang, Chuhang Zou, Manolis Savva, Yunchao Wei, Wuyang Chen
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2505.18926
Pdf URL: https://arxiv.org/pdf/2505.18926
Copy Paste: [[2505.18926]] Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time(https://arxiv.org/abs/2505.18926)
Keywords: robust, diffusion, generative
Abstract: We propose a neural physics system for real-time, interactive fluid simulations. Traditional physics-based methods, while accurate, are computationally intensive and suffer from latency issues. Recent machine-learning methods reduce computational costs while preserving fidelity; yet most still fail to satisfy the latency constraints for real-time use and lack support for interactive applications. To bridge this gap, we introduce a novel hybrid method that integrates numerical simulation, neural physics, and generative control. Our neural physics jointly pursues low-latency simulation and high physical fidelity by employing a fallback safeguard to classical numerical solvers. Furthermore, we develop a diffusion-based controller that is trained using a reverse modeling strategy to generate external dynamic force fields for fluid manipulation. Our system demonstrates robust performance across diverse 2D/3D scenarios, material types, and obstacle interactions, achieving real-time simulations at high frame rates (11~29% latency) while enabling fluid control guided by user-friendly freehand sketches. We present a significant step towards practical, controllable, and physically plausible fluid simulations for real-time interactive applications. We promise to release both models and data upon acceptance.

Title: Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments

Authors: Amel Muminovic (International Balkan University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18927
Pdf URL: https://arxiv.org/pdf/2505.18927
Copy Paste: [[2505.18927]] Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments(https://arxiv.org/abs/2505.18927)
Keywords: large language model
Abstract: As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.

Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Authors: Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18943
Pdf URL: https://arxiv.org/pdf/2505.18943
Copy Paste: [[2505.18943]] MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems(https://arxiv.org/abs/2505.18943)
Keywords: large language model
Abstract: Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at this https URL.

Title: Exemplifying Emerging Phishing: QR-based Browser-in-The-Browser (BiTB) Attack

Authors: Muhammad Wahid Akram, Keshav Sood, Muneeb Ul Hassan, Basant Subba
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2505.18944
Pdf URL: https://arxiv.org/pdf/2505.18944
Copy Paste: [[2505.18944]] Exemplifying Emerging Phishing: QR-based Browser-in-The-Browser (BiTB) Attack(https://arxiv.org/abs/2505.18944)
Keywords: attack, large language model
Abstract: Lately, cybercriminals constantly formulate productive approaches to exploit individuals. This article exemplifies an innovative attack, namely QR-based Browser-in-The-Browser (BiTB), using proficiencies of Large Language Model (LLM) i.e. Google Gemini. The presented attack is a fusion of two emerging attacks: BiTB and Quishing (QR code phishing). Our study underscores attack's simplistic implementation utilizing malicious prompts provided to Gemini-LLM. Moreover, we presented a case study to highlight a lucrative attack method, we also performed an experiment to comprehend the attack execution on victims' device. The findings of this work obligate the researchers' contributions in confronting this type of phishing attempts through LLMs.

Title: Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back

Authors: Jintao Sun, Hu Zhang, Gangyi Ding, Zhedong Zheng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.18945
Pdf URL: https://arxiv.org/pdf/2505.18945
Copy Paste: [[2505.18945]] Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back(https://arxiv.org/abs/2505.18945)
Keywords: robust
Abstract: Modern end-to-end autonomous driving systems suffer from a critical limitation: their planners lack mechanisms to enforce temporal consistency between predicted trajectories and evolving scene dynamics. This absence of self-supervision allows early prediction errors to compound catastrophically over time. We introduce Echo Planning, a novel self-correcting framework that establishes a closed-loop Current - Future - Current (CFC) cycle to harmonize trajectory prediction with scene coherence. Our key insight is that plausible future trajectories must be bi-directionally consistent, ie, not only generated from current observations but also capable of reconstructing them. The CFC mechanism first predicts future trajectories from the Bird's-Eye-View (BEV) scene representation, then inversely maps these trajectories back to estimate the current BEV state. By enforcing consistency between the original and reconstructed BEV representations through a cycle loss, the framework intrinsically penalizes physically implausible or misaligned trajectories. Experiments on nuScenes demonstrate state-of-the-art performance, reducing L2 error by 0.04 m and collision rate by 0.12% compared to one-shot planners. Crucially, our method requires no additional supervision, leveraging the CFC cycle as an inductive bias for robust planning. This work offers a deployable solution for safety-critical autonomous systems.

Title: OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Authors: Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18947
Pdf URL: https://arxiv.org/pdf/2505.18947
Copy Paste: [[2505.18947]] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model(https://arxiv.org/abs/2505.18947)
Keywords: diffusion, large language model
Abstract: Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{this https URL}

Title: Exact Expressive Power of Transformers with Padding

Authors: William Merrill, Ashish Sabharwal
Subjects: cs.LG, cs.CC, cs.FL
Abstract URL: https://arxiv.org/abs/2505.18948
Pdf URL: https://arxiv.org/pdf/2505.18948
Copy Paste: [[2505.18948]] Exact Expressive Power of Transformers with Padding(https://arxiv.org/abs/2505.18948)
Keywords: transformer, large language model
Abstract: Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, padded transformers converge to the class $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.

Title: The Price of Format: Diversity Collapse in LLMs

Authors: Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, Jingbo Shang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18949
Pdf URL: https://arxiv.org/pdf/2505.18949
Copy Paste: [[2505.18949]] The Price of Format: Diversity Collapse in LLMs(https://arxiv.org/abs/2505.18949)
Keywords: large language model
Abstract: Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

Authors: Saman Sarker Joy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18951
Pdf URL: https://arxiv.org/pdf/2505.18951
Copy Paste: [[2505.18951]] BnMMLU: Measuring Massive Multitask Language Understanding in Bengali(https://arxiv.org/abs/2505.18951)
Keywords: large language model
Abstract: The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.

Title: Online Knowledge Distillation with Reward Guidance

Authors: Chen Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18952
Pdf URL: https://arxiv.org/pdf/2505.18952
Copy Paste: [[2505.18952]] Online Knowledge Distillation with Reward Guidance(https://arxiv.org/abs/2505.18952)
Keywords: large language model
Abstract: This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the $Q$-value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

Title: How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

Authors: Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.18956
Pdf URL: https://arxiv.org/pdf/2505.18956
Copy Paste: [[2505.18956]] How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation(https://arxiv.org/abs/2505.18956)
Keywords: transformer, segmentation
Abstract: LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at

Title: CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Authors: Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18958
Pdf URL: https://arxiv.org/pdf/2505.18958
Copy Paste: [[2505.18958]] CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation(https://arxiv.org/abs/2505.18958)
Keywords: transformer, segmentation
Abstract: Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: this https URL.

Title: System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18962
Pdf URL: https://arxiv.org/pdf/2505.18962
Copy Paste: [[2505.18962]] System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts(https://arxiv.org/abs/2505.18962)
Keywords: transformer, large language model
Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent this http URL, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning).Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.

Title: MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models

Authors: Jeffrey A. Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18963
Pdf URL: https://arxiv.org/pdf/2505.18963
Copy Paste: [[2505.18963]] MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models(https://arxiv.org/abs/2505.18963)
Keywords: diffusion, generative
Abstract: Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. Our approach outperforms state-of-the-art methods, achieving accuracy gains of 4.4%, 2.9%, 1.6%, and 1.6% on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, respectively. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs. Our code is available on the project webpage: this https URL

Title: Protein Design with Dynamic Protein Vocabulary

Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Changzhi Sun, Tao Ji, Yuanbin Wu, Man Lan
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.18966
Pdf URL: https://arxiv.org/pdf/2505.18966
Copy Paste: [[2505.18966]] Protein Design with Dynamic Protein Vocabulary(https://arxiv.org/abs/2505.18966)
Keywords: generative
Abstract: Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.

Title: Learning to Explain: Prototype-Based Surrogate Models for LLM Classification

Authors: Bowen Wei, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18970
Pdf URL: https://arxiv.org/pdf/2505.18970
Copy Paste: [[2505.18970]] Learning to Explain: Prototype-Based Surrogate Models for LLM Classification(https://arxiv.org/abs/2505.18970)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model's reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbf{ProtoSurE}, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.

Title: Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE

Authors: Abhijit Chakraborty, Chahana Dahal, Ashutosh Balasubramaniam, Tejas Anvekar, Vivek Gupta
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18971
Pdf URL: https://arxiv.org/pdf/2505.18971
Copy Paste: [[2505.18971]] Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE(https://arxiv.org/abs/2505.18971)
Keywords: robust
Abstract: We revisit the efficacy of simple, real-valued embedding models for knowledge graph completion and introduce RelatE, an interpretable and modular method that efficiently integrates dual representations for entities and relations. RelatE employs a real-valued phase-modulus decomposition, leveraging sinusoidal phase alignments to encode relational patterns such as symmetry, inversion, and composition. In contrast to recent approaches based on complex-valued embeddings or deep neural architectures, RelatE preserves architectural simplicity while achieving competitive or superior performance on standard benchmarks. Empirically, RelatE outperforms prior methods across several datasets: on YAGO3-10, it achieves an MRR of 0.521 and Hit@10 of 0.680, surpassing all baselines. Additionally, RelatE offers significant efficiency gains, reducing training time by 24%, inference latency by 31%, and peak GPU memory usage by 22% compared to RotatE. Perturbation studies demonstrate improved robustness, with MRR degradation reduced by up to 61% relative to TransE and by up to 19% compared to RotatE under structural edits such as edge removals and relation swaps. Formal analysis further establishes the model's full expressiveness and its capacity to represent essential first-order logical inference patterns. These results position RelatE as a scalable and interpretable alternative to more complex architectures for knowledge graph completion.

Title: Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Authors: Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18973
Pdf URL: https://arxiv.org/pdf/2505.18973
Copy Paste: [[2505.18973]] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings(https://arxiv.org/abs/2505.18973)
Keywords: robust, large language model
Abstract: Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with "learnable" curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.

Title: AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models

Authors: Miguel Angel Peñaloza Perez (1 and 2 and 5), Bruno Lopez Orozco (1 and 2 and 3), Jesus Tadeo Cruz Soto (1 and 2 and 4), Michelle Bruno Hernandez (1 and 2), Miguel Angel Alvarado Gonzalez (1 and 2), Sandra Malagon (1 and 2) ((1) Carreras con Impacto, (2) Aixo Lab, (3) Facultad de Ciencias UNAM Mexico, (4) Facultad de Matematicas Universidad Veracruzana Mexico, (5) Centro de Investigación Cientifica y de Educacion Superior de Ensenada Baja California Mexico)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18978
Pdf URL: https://arxiv.org/pdf/2505.18978
Copy Paste: [[2505.18978]] AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models(https://arxiv.org/abs/2505.18978)
Keywords: large language model
Abstract: Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.

Title: GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

Authors: Zixuan Chen, Hao Lin, Ke Xu, Xinghao Jiang, Tanfeng Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18979
Pdf URL: https://arxiv.org/pdf/2505.18979
Copy Paste: [[2505.18979]] GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization(https://arxiv.org/abs/2505.18979)
Keywords: defense, attack, generative, large language model
Abstract: Text-to-image (T2I) generation models can inadvertently produce not-safe-for-work (NSFW) content, prompting the integration of text and image safety filters. Recent advances employ large language models (LLMs) for semantic-level detection, rendering traditional token-level perturbation attacks largely ineffective. However, our evaluation shows that existing jailbreak methods are ineffective against these modern filters. We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) Dynamic Optimization, an iterative process that guides a large language model (LLM) using feedback from text safety filters and CLIP similarity scores to generate semantically aligned adversarial prompts; and (ii) Adaptive Safety Indicator Injection, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. GhostPrompt achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5\% (Sneakyprompt) to 99.0\%, improving CLIP score from 0.2637 to 0.2762, and reducing the time cost by $4.2 \times$. Moreover, it generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images in our evaluation, revealing systemic vulnerabilities in current multimodal defenses. To support further research on AI safety and red-teaming, we will release code and adversarial prompts under a controlled-access protocol.

Title: FedSKC: Federated Learning with Non-IID Data via Structural Knowledge Collaboration

Authors: Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Lijuan Wang, Jiahua Shi, Shiping Chen, Jun Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18981
Pdf URL: https://arxiv.org/pdf/2505.18981
Copy Paste: [[2505.18981]] FedSKC: Federated Learning with Non-IID Data via Structural Knowledge Collaboration(https://arxiv.org/abs/2505.18981)
Keywords: privacy, federate, fair
Abstract: With the advancement of edge computing, federated learning (FL) displays a bright promise as a privacy-preserving collaborative learning paradigm. However, one major challenge for FL is the data heterogeneity issue, which refers to the biased labeling preferences among multiple clients, negatively impacting convergence and model performance. Most previous FL methods attempt to tackle the data heterogeneity issue locally or globally, neglecting underlying class-wise structure information contained in each client. In this paper, we first study how data heterogeneity affects the divergence of the model and decompose it into local, global, and sampling drift sub-problems. To explore the potential of using intra-client class-wise structural knowledge in handling these drifts, we thus propose Federated Learning with Structural Knowledge Collaboration (FedSKC). The key idea of FedSKC is to extract and transfer domain preferences from inter-client data distributions, offering diverse class-relevant knowledge and a fair convergent signal. FedSKC comprises three components: i) local contrastive learning, to prevent weight divergence resulting from local training; ii) global discrepancy aggregation, which addresses the parameter deviation between the server and clients; iii) global period review, correcting for the sampling drift introduced by the server randomly selecting devices. We have theoretically analyzed FedSKC under non-convex objectives and empirically validated its superiority through extensive experimental results.

Title: AmorLIP: Efficient Language-Image Pretraining via Amortization

Authors: Haotian Sun, Yitong Li, Yuchen Zhuang, Niao He, Hanjun Dai, Bo Dai
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18983
Pdf URL: https://arxiv.org/pdf/2505.18983
Copy Paste: [[2505.18983]] AmorLIP: Efficient Language-Image Pretraining via Amortization(https://arxiv.org/abs/2505.18983)
Keywords: robust
Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.

Title: STRICT: Stress Test of Rendering Images Containing Text

Authors: Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18985
Pdf URL: https://arxiv.org/pdf/2505.18985
Copy Paste: [[2505.18985]] STRICT: Stress Test of Rendering Images Containing Text(https://arxiv.org/abs/2505.18985)
Keywords: diffusion, generative
Abstract: While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at this https URL.

Title: SPARS: Self-Play Adversarial Reinforcement Learning for Segmentation of Liver Tumours

Authors: Catalina Tan, Yipeng Hu, Shaheer U. Saeed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18989
Pdf URL: https://arxiv.org/pdf/2505.18989
Copy Paste: [[2505.18989]] SPARS: Self-Play Adversarial Reinforcement Learning for Segmentation of Liver Tumours(https://arxiv.org/abs/2505.18989)
Keywords: segmentation
Abstract: Accurate tumour segmentation is vital for various targeted diagnostic and therapeutic procedures for cancer, e.g., planning biopsies or tumour ablations. Manual delineation is extremely labour-intensive, requiring substantial expert time. Fully-supervised machine learning models aim to automate such localisation tasks, but require a large number of costly and often subjective 3D voxel-level labels for training. The high-variance and subjectivity in such labels impacts model generalisability, even when large datasets are available. Histopathology labels may offer more objective labels but the infeasibility of acquiring pixel-level annotations to develop tumour localisation methods based on histology remains challenging in-vivo. In this work, we propose a novel weakly-supervised semantic segmentation framework called SPARS (Self-Play Adversarial Reinforcement Learning for Segmentation), which utilises an object presence classifier, trained on a small number of image-level binary cancer presence labels, to localise cancerous regions on CT scans. Such binary labels of patient-level cancer presence can be sourced more feasibly from biopsies and histopathology reports, enabling a more objective cancer localisation on medical images. Evaluating with real patient data, we observed that SPARS yielded a mean dice score of $77.3 \pm 9.4$, which outperformed other weakly-supervised methods by large margins. This performance was comparable with recent fully-supervised methods that require voxel-level annotations. Our results demonstrate the potential of using SPARS to reduce the need for extensive human-annotated labels to detect cancer in real-world healthcare settings.

Title: Kernel Space Diffusion Model for Efficient Remote Sensing Pansharpening

Authors: Hancong Jin, Zihan Cao, Liangjian Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18991
Pdf URL: https://arxiv.org/pdf/2505.18991
Copy Paste: [[2505.18991]] Kernel Space Diffusion Model for Efficient Remote Sensing Pansharpening(https://arxiv.org/abs/2505.18991)
Keywords: diffusion
Abstract: Pansharpening is a fundamental task in remote sensing that integrates high-resolution panchromatic imagery (PAN) with low-resolution multispectral imagery (LRMS) to produce an enhanced image with both high spatial and spectral resolution. Despite significant progress in deep learning-based approaches, existing methods often fail to capture the global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities; however, they suffer from significant inference latency, which limits their practical applicability. In this work, we propose the Kernel Space Diffusion Model (KSDiff), a novel approach that leverages diffusion processes in a latent space to generate convolutional kernels enriched with global contextual information, thereby improving pansharpening quality while enabling faster inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, enabling KSDiff to serve as a framework for enhancing existing pansharpening architectures. Experiments on three widely used datasets, including WorldView-3, GaoFen-2, and QuickBird, demonstrate the superior performance of KSDiff both qualitatively and quantitatively. Code will be released upon possible acceptance.

Title: VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes

Authors: Tianchen Deng, Wenhua Wu, Junjie He, Yue Pan, Xirui Jiang, Shenghai Yuan, Danwei Wang, Hesheng Wang, Weidong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18992
Pdf URL: https://arxiv.org/pdf/2505.18992
Copy Paste: [[2505.18992]] VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes(https://arxiv.org/abs/2505.18992)
Keywords: robust
Abstract: 3D Gaussian Splatting has recently shown promising results in dense visual SLAM. However, existing 3DGS-based SLAM methods are all constrained to small-room scenarios and struggle with memory explosion in large-scale scenes and long sequences. To this end, we propose VPGS-SLAM, the first 3DGS-based large-scale RGBD SLAM framework for both indoor and outdoor scenarios. We design a novel voxel-based progressive 3D Gaussian mapping method with multiple submaps for compact and accurate scene representation in large-scale and long-sequence scenes. This allows us to scale up to arbitrary scenes and improves robustness (even under pose drifts). In addition, we propose a 2D-3D fusion camera tracking method to achieve robust and accurate camera tracking in both indoor and outdoor large-scale scenes. Furthermore, we design a 2D-3D Gaussian loop closure method to eliminate pose drift. We further propose a submap fusion method with online distillation to achieve global consistency in large-scale scenes when detecting a loop. Experiments on various indoor and outdoor datasets demonstrate the superiority and generalizability of the proposed framework. The code will be open source on this https URL.

Title: FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)

Authors: Carlos Jude G. Maminta, Isaiah Job Enriquez, Deandre Nigel Nunez, Michael B. Dela Fuente (Institution College of Computer and Information Sciences, Polytechnic University of the Philippines, Sta. Mesa, Manila)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18995
Pdf URL: https://arxiv.org/pdf/2505.18995
Copy Paste: [[2505.18995]] FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)(https://arxiv.org/abs/2505.18995)
Keywords: large language model
Abstract: This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.

Title: Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

Authors: Bob Junyi Zou, Lu Tian
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18996
Pdf URL: https://arxiv.org/pdf/2505.18996
Copy Paste: [[2505.18996]] Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs(https://arxiv.org/abs/2505.18996)
Keywords: robust
Abstract: Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Authors: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19000
Pdf URL: https://arxiv.org/pdf/2505.19000
Copy Paste: [[2505.19000]] VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization(https://arxiv.org/abs/2505.19000)
Keywords: large language model
Abstract: Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream this http URL address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

Title: Secure IVSHMEM: End-to-End Shared-Memory Protocol with Hypervisor-CA Handshake and In-Kernel Access Control

Authors: Hyunwoo Kim, Jaeseong Lee, Sunpyo Hong, Changmin Han
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19004
Pdf URL: https://arxiv.org/pdf/2505.19004
Copy Paste: [[2505.19004]] Secure IVSHMEM: End-to-End Shared-Memory Protocol with Hypervisor-CA Handshake and In-Kernel Access Control(https://arxiv.org/abs/2505.19004)
Keywords: secure, security
Abstract: In-host shared memory (IVSHMEM) enables high-throughput, zero-copy communication between virtual machines, but today's implementations lack any security control, allowing any application to eavesdrop or tamper with the IVSHMEM region. This paper presents Secure IVSHMEM, a protocol that provides end-to-end mutual authentication and fine-grained access enforcement with negligible performance cost. We combine three techniques to ensure security: (1) channel separation and kernel module access control, (2)hypervisor-mediated handshake for end-to-end service authentication, and (3)application-level integration for abstraction and performance mitigation. In microbenchmarks, Secure IVSHMEM completes its one-time handshake in under 200ms and sustains data-plane round-trip latencies within 5\% of the unmodified baseline, with negligible bandwidth overhead. We believe this design is ideally suited for safety and latency-critical in-host domains, such as automotive systems, where both performance and security are paramount.

Title: A quantitative notion of economic security for smart contract compositions

Authors: Emily Priyadarshini, Massimo Bartoletti
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19006
Pdf URL: https://arxiv.org/pdf/2505.19006
Copy Paste: [[2505.19006]] A quantitative notion of economic security for smart contract compositions(https://arxiv.org/abs/2505.19006)
Keywords: security, attack
Abstract: Decentralized applications are often composed of multiple interconnected smart contracts. This is especially evident in DeFi, where protocols are heavily intertwined and rely on a variety of basic building blocks such as tokens, decentralized exchanges and lending protocols. A crucial security challenge in this setting arises when adversaries target individual components to cause systemic economic losses. Existing security notions focus on determining the existence of these attacks, but fail to quantify the effect of manipulating individual components on the overall economic security of the system. In this paper, we introduce a quantitative security notion that measures how an attack on a single component can amplify economic losses of the overall system. We study the fundamental properties of this notion and apply it to assess the security of key compositions. In particular, we analyse under-collateralized loan attacks in systems made of lending protocols and decentralized exchanges.

Title: Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

Authors: Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19010
Pdf URL: https://arxiv.org/pdf/2505.19010
Copy Paste: [[2505.19010]] Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection(https://arxiv.org/abs/2505.19010)
Keywords: robust
Abstract: Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.

Title: Faithful Group Shapley Value

Authors: Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang
Subjects: cs.LG, cs.AI, econ.GN, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19013
Pdf URL: https://arxiv.org/pdf/2505.19013
Copy Paste: [[2505.19013]] Faithful Group Shapley Value(https://arxiv.org/abs/2505.19013)
Keywords: attack, fair
Abstract: Data Shapley is an important tool for data valuation, which quantifies the contribution of individual data points to machine learning models. In practice, group-level data valuation is desirable when data providers contribute data in batch. However, we identify that existing group-level extensions of Data Shapley are vulnerable to shell company attacks, where strategic group splitting can unfairly inflate valuations. We propose Faithful Group Shapley Value (FGSV) that uniquely defends against such attacks. Building on original mathematical insights, we develop a provably fast and accurate approximation algorithm for computing FGSV. Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.

Title: Tokenizing Electron Cloud in Protein-Ligand Interaction Learning

Authors: Haitao Lin, Odin Zhang, Jia Xu, Yunfan Liu, Zheng Cheng, Lirong Wu, Yufei Huang, Zhifeng Gao, Stan Z. Li
Subjects: cs.LG, physics.chem-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.19014
Pdf URL: https://arxiv.org/pdf/2505.19014
Copy Paste: [[2505.19014]] Tokenizing Electron Cloud in Protein-Ligand Interaction Learning(https://arxiv.org/abs/2505.19014)
Keywords: transformer
Abstract: The affinity and specificity of protein-molecule binding directly impact functional outcomes, uncovering the mechanisms underlying biological regulation and signal transduction. Most deep-learning-based prediction approaches focus on structures of atoms or fragments. However, quantum chemical properties, such as electronic structures, are the key to unveiling interaction patterns but remain largely underexplored. To bridge this gap, we propose ECBind, a method for tokenizing electron cloud signals into quantized embeddings, enabling their integration into downstream tasks such as binding affinity prediction. By incorporating electron densities, ECBind helps uncover binding modes that cannot be fully represented by atom-level models. Specifically, to remove the redundancy inherent in electron cloud signals, a structure-aware transformer and hierarchical codebooks encode 3D binding sites enriched with electron structures into tokens. These tokenized codes are then used for specific tasks with labels. To extend its applicability to a wider range of scenarios, we utilize knowledge distillation to develop an electron-cloud-agnostic prediction model. Experimentally, ECBind demonstrates state-of-the-art performance across multiple tasks, achieving improvements of 6.42\% and 15.58\% in per-structure Pearson and Spearman correlation coefficients, respectively.

Title: Can Multimodal Large Language Models Understand Spatial Relations?

Authors: Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.19015
Pdf URL: https://arxiv.org/pdf/2505.19015
Copy Paste: [[2505.19015]] Can Multimodal Large Language Models Understand Spatial Relations?(https://arxiv.org/abs/2505.19015)
Keywords: large language model
Abstract: Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at this https URL.

Title: CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language

Authors: Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, Md. Rajib Hossain, Md. Saifur Rahman, A. B. M. Shawkat Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19018
Pdf URL: https://arxiv.org/pdf/2505.19018
Copy Paste: [[2505.19018]] CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language(https://arxiv.org/abs/2505.19018)
Keywords: transformer
Abstract: Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.

Title: Querying Kernel Methods Suffices for Reconstructing their Training Data

Authors: Daniel Barzilai, Yuval Margalit, Eitan Gronich, Gilad Yehudai, Meirav Galun, Ronen Basri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19019
Pdf URL: https://arxiv.org/pdf/2505.19019
Copy Paste: [[2505.19019]] Querying Kernel Methods Suffices for Reconstructing their Training Data(https://arxiv.org/abs/2505.19019)
Keywords: privacy
Abstract: Over-parameterized models have raised concerns about their potential to memorize training data, even when achieving strong generalization. The privacy implications of such memorization are generally unclear, particularly in scenarios where only model outputs are accessible. We study this question in the context of kernel methods, and demonstrate both empirically and theoretically that querying kernel models at various points suffices to reconstruct their training data, even without access to model parameters. Our results hold for a range of kernel methods, including kernel regression, support vector machines, and kernel density estimation. Our hope is that this work can illuminate potential privacy concerns for such models.

Title: A Smart Healthcare System for Monkeypox Skin Lesion Detection and Tracking

Authors: Huda Alghoraibi, Nuha Alqurashi, Sarah Alotaibi, Renad Alkhudaydi, Bdoor Aldajani, Lubna Alqurashi, Jood Batweel, Maha A. Thafar
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19023
Pdf URL: https://arxiv.org/pdf/2505.19023
Copy Paste: [[2505.19023]] A Smart Healthcare System for Monkeypox Skin Lesion Detection and Tracking(https://arxiv.org/abs/2505.19023)
Keywords: transformer
Abstract: Monkeypox is a viral disease characterized by distinctive skin lesions and has been reported in many countries. The recent global outbreak has emphasized the urgent need for scalable, accessible, and accurate diagnostic solutions to support public health responses. In this study, we developed ITMAINN, an intelligent, AI-driven healthcare system specifically designed to detect Monkeypox from skin lesion images using advanced deep learning techniques. Our system consists of three main components. First, we trained and evaluated several pretrained models using transfer learning on publicly available skin lesion datasets to identify the most effective models. For binary classification (Monkeypox vs. non-Monkeypox), the Vision Transformer, MobileViT, Transformer-in-Transformer, and VGG16 achieved the highest performance, each with an accuracy and F1-score of 97.8%. For multiclass classification, which contains images of patients with Monkeypox and five other classes (chickenpox, measles, hand-foot-mouth disease, cowpox, and healthy), ResNetViT and ViT Hybrid models achieved 92% accuracy, with F1 scores of 92.24% and 92.19%, respectively. The best-performing and most lightweight model, MobileViT, was deployed within the mobile application. The second component is a cross-platform smartphone application that enables users to detect Monkeypox through image analysis, track symptoms, and receive recommendations for nearby healthcare centers based on their location. The third component is a real-time monitoring dashboard designed for health authorities to support them in tracking cases, analyzing symptom trends, guiding public health interventions, and taking proactive measures. This system is fundamental in developing responsive healthcare infrastructure within smart cities. Our solution, ITMAINN, is part of revolutionizing public health management.

Title: InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Authors: Minzhi Lin, Tianchi Xie, Mengchen Liu, Yilin Ye, Changjian Chen, Shixia Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19028
Pdf URL: https://arxiv.org/pdf/2505.19028
Copy Paste: [[2505.19028]] InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts(https://arxiv.org/abs/2505.19028)
Keywords: large language model
Abstract: Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at this https URL.

Title: A Systematic Classification of Vulnerabilities in MoveEVM Smart Contracts (MWC)

Authors: Selçuk Topal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19047
Pdf URL: https://arxiv.org/pdf/2505.19047
Copy Paste: [[2505.19047]] A Systematic Classification of Vulnerabilities in MoveEVM Smart Contracts (MWC)(https://arxiv.org/abs/2505.19047)
Keywords: secure, security
Abstract: We introduce the MoveEVM Weakness Classification (MWC) system -- a dedicated vulnerability taxonomy for smart contracts built with Move and executed in EVM-compatible environments. While Move was originally designed to prevent common security flaws via linear resource types and strict ownership, its integration with EVM bytecode introduces novel hybrid vulnerabilities not captured by existing systems like the SWC registry. Our taxonomy spans 37 categorized vulnerability types (MWC-100 to MWC-136) across six semantic frames, addressing issues such as hybrid gas metering, capability misuse, meta-transaction spoofing, and AI-integrated logic. Through analysis of real-world contracts from Aptos and Sui, we demonstrate that current verification tools often miss these hybrid risks. We also explore how formal methods and LLM-based audit agents can operationalize this classification, enabling scalable, logic-aware smart contract auditing. MWC lays the foundation for more secure and verifiable contracts in next-generation blockchain systems. (Shortened Abstract)

Title: Efficient Data Selection at Scale via Influence Distillation

Authors: Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19051
Pdf URL: https://arxiv.org/pdf/2505.19051
Copy Paste: [[2505.19051]] Efficient Data Selection at Scale via Influence Distillation(https://arxiv.org/abs/2505.19051)
Keywords: large language model
Abstract: Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a $\textit{landmark-based approximation}$: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5\times$ faster selection.

Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Authors: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19056
Pdf URL: https://arxiv.org/pdf/2505.19056
Copy Paste: [[2505.19056]] An Embarrassingly Simple Defense Against LLM Abliteration Attacks(https://arxiv.org/abs/2505.19056)
Keywords: defense, attack, large language model
Abstract: Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

Title: Distributionally Robust Deep Q-Learning

Authors: Chung I Lu, Julian Sester, Aijia Zhang
Subjects: cs.LG, math.OC, q-fin.PM, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19058
Pdf URL: https://arxiv.org/pdf/2505.19058
Copy Paste: [[2505.19058]] Distributionally Robust Deep Q-Learning(https://arxiv.org/abs/2505.19058)
Keywords: robust
Abstract: We propose a novel distributionally robust $Q$-learning algorithm for the non-tabular case accounting for continuous state spaces where the state transition of the underlying Markov decision process is subject to model uncertainty. The uncertainty is taken into account by considering the worst-case transition from a ball around a reference probability measure. To determine the optimal policy under the worst-case state transition, we solve the associated non-linear Bellman equation by dualising and regularising the Bellman operator with the Sinkhorn distance, which is then parameterized with deep neural networks. This approach allows us to modify the Deep Q-Network algorithm to optimise for the worst case state transition. We illustrate the tractability and effectiveness of our approach through several applications, including a portfolio optimisation task based on S\&{P}~500 data.

Title: UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

Authors: Roman Vashurin, Maiya Goloburda, Preslav Nakov, Maxim Panov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19060
Pdf URL: https://arxiv.org/pdf/2505.19060
Copy Paste: [[2505.19060]] UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models(https://arxiv.org/abs/2505.19060)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.

Title: Training-free Stylized Text-to-Image Generation with Fast Inference

Authors: Xin Ma, Yaohui Wang, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19063
Pdf URL: https://arxiv.org/pdf/2505.19063
Copy Paste: [[2505.19063]] Training-free Stylized Text-to-Image Generation with Fast Inference(https://arxiv.org/abs/2505.19063)
Keywords: diffusion, generative
Abstract: Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.

Title: Towards Harmonized Uncertainty Estimation for Large Language Models

Authors: Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19073
Pdf URL: https://arxiv.org/pdf/2505.19073
Copy Paste: [[2505.19073]] Towards Harmonized Uncertainty Estimation for Large Language Models(https://arxiv.org/abs/2505.19073)
Keywords: robust, large language model
Abstract: To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM's performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.

Title: ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Authors: Muye Huang, Lingling Zhang, Jie Ma, Han Lai, Fangzhi Xu, Yifei Li, Wenjun Wu, Yaqiang Wu, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19076
Pdf URL: https://arxiv.org/pdf/2505.19076
Copy Paste: [[2505.19076]] ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding(https://arxiv.org/abs/2505.19076)
Keywords: extraction, large language model
Abstract: Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.

Title: Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark

Authors: Ruiyang Xia, Dawei Zhou, Decheng Liu, Lin Yuan, Jie Li, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19081
Pdf URL: https://arxiv.org/pdf/2505.19081
Copy Paste: [[2505.19081]] Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark(https://arxiv.org/abs/2505.19081)
Keywords: security, privacy, defense, robust, watermark
Abstract: Face swapping, recognized as a privacy and security concern, has prompted considerable defensive research. With the advancements in AI-generated content, the discrepancies between the real and swapped faces have become nuanced. Considering the difficulty of forged traces detection, we shift the focus to the face swapping purpose and proactively embed elaborate watermarks against unknown face swapping techniques. Given that the constant purpose is to swap the original face identity while preserving the background, we concentrate on the regions surrounding the face to ensure robust watermark generation, while embedding the contour texture and face identity information to achieve progressive image determination. The watermark is located in the facial contour and contains hybrid messages, dubbed the contour-hybrid watermark (CMark). Our approach generalizes face swapping detection without requiring any swapping techniques during training and the storage of large-scale messages in advance. Experiments conducted across 8 face swapping techniques demonstrate the superiority of our approach compared with state-of-the-art passive and proactive detectors while achieving a favorable balance between the image quality and watermark robustness.

Title: Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Authors: Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19084
Pdf URL: https://arxiv.org/pdf/2505.19084
Copy Paste: [[2505.19084]] Jodi: Unification of Visual Generation and Understanding via Joint Modeling(https://arxiv.org/abs/2505.19084)
Keywords: diffusion, transformer
Abstract: Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at this https URL.

Title: Plug-and-Play Context Feature Reuse for Efficient Masked Generation

Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19089
Pdf URL: https://arxiv.org/pdf/2505.19089
Copy Paste: [[2505.19089]] Plug-and-Play Context Feature Reuse for Efficient Masked Generation(https://arxiv.org/abs/2505.19089)
Keywords: generative
Abstract: Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.

Title: CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations

Authors: Haotian Si, Changhua Pei, Jianhui Li, Dan Pei, Gaogang Xie
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19090
Pdf URL: https://arxiv.org/pdf/2505.19090
Copy Paste: [[2505.19090]] CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations(https://arxiv.org/abs/2505.19090)
Keywords: interpretability
Abstract: Recent advances in lightweight time series forecasting models suggest the inherent simplicity of time series forecasting tasks. In this paper, we present CMoS, a super-lightweight time series forecasting model. Instead of learning the embedding of the shapes, CMoS directly models the spatial correlations between different time series chunks. Additionally, we introduce a Correlation Mixing technique that enables the model to capture diverse spatial correlations with minimal parameters, and an optional Periodicity Injection technique to ensure faster convergence. Despite utilizing as low as 1% of the lightweight model DLinear's parameters count, experimental results demonstrate that CMoS outperforms existing state-of-the-art models across multiple datasets. Furthermore, the learned weights of CMoS exhibit great interpretability, providing practitioners with valuable insights into temporal structures within specific application scenarios.

Title: Towards Robust Influence Functions with Flat Validation Minima

Authors: Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, Yifan Chen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19097
Pdf URL: https://arxiv.org/pdf/2505.19097
Copy Paste: [[2505.19097]] Towards Robust Influence Functions with Flat Validation Minima(https://arxiv.org/abs/2505.19097)
Keywords: robust
Abstract: The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.

Title: ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Authors: Yeyuan Wang, Dehong Gao, Rujiao Long, Lei Yi, Linbo Jin, Libin Yang, Xiaoyan Cai
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19100
Pdf URL: https://arxiv.org/pdf/2505.19100
Copy Paste: [[2505.19100]] ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning(https://arxiv.org/abs/2505.19100)
Keywords: large language model
Abstract: Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.

Title: Optimization-Inspired Few-Shot Adaptation for Large Language Models

Authors: Boyan Gao, Xin Wang, Yibo Yang, David Clifton
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19107
Pdf URL: https://arxiv.org/pdf/2505.19107
Copy Paste: [[2505.19107]] Optimization-Inspired Few-Shot Adaptation for Large Language Models(https://arxiv.org/abs/2505.19107)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as in-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: in-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.

Title: CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models

Authors: Yongheng Zhang, Xu Liu, Ruoxi Zhou, Qiguang Chen, Hao Fei, Wenpeng Lu, Libo Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19108
Pdf URL: https://arxiv.org/pdf/2505.19108
Copy Paste: [[2505.19108]] CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models(https://arxiv.org/abs/2505.19108)
Keywords: large language model
Abstract: Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

Title: An Interpretable Representation Learning Approach for Diffusion Tensor Imaging

Authors: Vishwa Mohan Singh, Alberto Gaston Villagran Asiares, Luisa Sophie Schuhmacher, Kate Rendall, Simon Weißbrod, David Rügamer, Inga Körte
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19110
Pdf URL: https://arxiv.org/pdf/2505.19110
Copy Paste: [[2505.19110]] An Interpretable Representation Learning Approach for Diffusion Tensor Imaging(https://arxiv.org/abs/2505.19110)
Keywords: diffusion
Abstract: Diffusion Tensor Imaging (DTI) tractography offers detailed insights into the structural connectivity of the brain, but presents challenges in effective representation and interpretation in deep learning models. In this work, we propose a novel 2D representation of DTI tractography that encodes tract-level fractional anisotropy (FA) values into a 9x9 grayscale image. This representation is processed through a Beta-Total Correlation Variational Autoencoder with a Spatial Broadcast Decoder to learn a disentangled and interpretable latent embedding. We evaluate the quality of this embedding using supervised and unsupervised representation learning strategies, including auxiliary classification, triplet loss, and SimCLR-based contrastive learning. Compared to the 1D Group deep neural network (DNN) baselines, our approach improves the F1 score in a downstream sex classification task by 15.74% and shows a better disentanglement than the 3D representation.

Title: Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering

Authors: Zheng Chu, Huiming Fan, Jingchang Chen, Qianyu Wang, Mingda Yang, Jiafeng Liang, Zhongjie Wang, Hao Li, Guo Tang, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19112
Pdf URL: https://arxiv.org/pdf/2505.19112
Copy Paste: [[2505.19112]] Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering(https://arxiv.org/abs/2505.19112)
Keywords: large language model
Abstract: Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: this https URL.

Title: CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

Authors: Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Chen, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19114
Pdf URL: https://arxiv.org/pdf/2505.19114
Copy Paste: [[2505.19114]] CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design(https://arxiv.org/abs/2505.19114)
Keywords: diffusion, transformer
Abstract: Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

Title: FP4 All the Way: Fully Quantized Training of LLMs

Authors: Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19115
Pdf URL: https://arxiv.org/pdf/2505.19115
Copy Paste: [[2505.19115]] FP4 All the Way: Fully Quantized Training of LLMs(https://arxiv.org/abs/2505.19115)
Keywords: large language model
Abstract: We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in this https URL .

Title: Controlling Language Confusion in Multilingual LLMs

Authors: Nahyun Lee, Yeongseo Woo, Hyunwoo Ko, Guijin Son
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19116
Pdf URL: https://arxiv.org/pdf/2505.19116
Copy Paste: [[2505.19116]] Controlling Language Confusion in Multilingual LLMs(https://arxiv.org/abs/2505.19116)
Keywords: large language model
Abstract: Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.

Title: Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition

Authors: Xiaoyang Liu, Bolin Qiu, Jiezhang Cao, Zheng Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19120
Pdf URL: https://arxiv.org/pdf/2505.19120
Copy Paste: [[2505.19120]] Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition(https://arxiv.org/abs/2505.19120)
Keywords: robust, transformer
Abstract: Image demoiréing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moiré patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoiréing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moiré patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moiré-sensitive regions without incurring high computational cost. Extensive experiments on various demoiréing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at this https URL.

Title: Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models

Authors: Seunguk Yu, Juhwan Choi, Youngbin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19121
Pdf URL: https://arxiv.org/pdf/2505.19121
Copy Paste: [[2505.19121]] Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models(https://arxiv.org/abs/2505.19121)
Keywords: large language model
Abstract: Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.

Title: Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Authors: Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von Rütte
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19122
Pdf URL: https://arxiv.org/pdf/2505.19122
Copy Paste: [[2505.19122]] Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers(https://arxiv.org/abs/2505.19122)
Keywords: diffusion, transformer, generative
Abstract: Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning

Authors: Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19126
Pdf URL: https://arxiv.org/pdf/2505.19126
Copy Paste: [[2505.19126]] MMATH: A Multilingual Benchmark for Mathematical Reasoning(https://arxiv.org/abs/2505.19126)
Keywords: large language model
Abstract: The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at this https URL.

Title: RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models

Authors: Jin Zhang, Fan Gao, Linyu Li, Yongbin Yu, Xiangxiang Wang, Nyima Tashi, Gadeng Luosang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19128
Pdf URL: https://arxiv.org/pdf/2505.19128
Copy Paste: [[2505.19128]] RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models(https://arxiv.org/abs/2505.19128)
Keywords: large language model
Abstract: The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from "prompt-guided inference" to "prompt-driven learning." Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.

Title: Fast and Accurate Power Load Data Completion via Regularization-optimized Low-Rank Factorization

Authors: Yan Xia, Hao Feng, Hongwei Sun, Junjie Wang, Qicong Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19133
Pdf URL: https://arxiv.org/pdf/2505.19133
Copy Paste: [[2505.19133]] Fast and Accurate Power Load Data Completion via Regularization-optimized Low-Rank Factorization(https://arxiv.org/abs/2505.19133)
Keywords: interpretability
Abstract: Low-rank representation learning has emerged as a powerful tool for recovering missing values in power load data due to its ability to exploit the inherent low-dimensional structures of spatiotemporal measurements. Among various techniques, low-rank factorization models are favoured for their efficiency and interpretability. However, their performance is highly sensitive to the choice of regularization parameters, which are typically fixed or manually tuned, resulting in limited generalization capability or slow convergence in practical scenarios. In this paper, we propose a Regularization-optimized Low-Rank Factorization, which introduces a Proportional-Integral-Derivative controller to adaptively adjust the regularization coefficient. Furthermore, we provide a detailed algorithmic complexity analysis, showing that our method preserves the computational efficiency of stochastic gradient descent while improving adaptivity. Experimental results on real-world power load datasets validate the superiority of our method in both imputation accuracy and training efficiency compared to existing baselines.

Title: Veta-GS: View-dependent deformable 3D Gaussian Splatting for thermal infrared Novel-view Synthesis

Authors: Myeongseok Nam, Wongi Park, Minsol Kim, Hyejin Hur, Soomok Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19138
Pdf URL: https://arxiv.org/pdf/2505.19138
Copy Paste: [[2505.19138]] Veta-GS: View-dependent deformable 3D Gaussian Splatting for thermal infrared Novel-view Synthesis(https://arxiv.org/abs/2505.19138)
Keywords: robust
Abstract: Recently, 3D Gaussian Splatting (3D-GS) based on Thermal Infrared (TIR) imaging has gained attention in novel-view synthesis, showing real-time rendering. However, novel-view synthesis with thermal infrared images suffers from transmission effects, emissivity, and low resolution, leading to floaters and blur effects in rendered images. To address these problems, we introduce Veta-GS, which leverages a view-dependent deformation field and a Thermal Feature Extractor (TFE) to precisely capture subtle thermal variations and maintain robustness. Specifically, we design view-dependent deformation field that leverages camera position and viewing direction, which capture thermal variations. Furthermore, we introduce the Thermal Feature Extractor (TFE) and MonoSSIM loss, which consider appearance, edge, and frequency to maintain robustness. Extensive experiments on the TI-NSD benchmark show that our method achieves better performance over existing methods.

Title: The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

Authors: Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, Wei Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19139
Pdf URL: https://arxiv.org/pdf/2505.19139
Copy Paste: [[2505.19139]] The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework(https://arxiv.org/abs/2505.19139)
Keywords: privacy, large language model
Abstract: Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.

Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19147
Pdf URL: https://arxiv.org/pdf/2505.19147
Copy Paste: [[2505.19147]] Shifting AI Efficiency From Model-Centric to Data-Centric Compression(https://arxiv.org/abs/2505.19147)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.

Title: MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection

Authors: Shuyu Wang, Weiqi Li, Qian Wang, Shijie Zhao, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19149
Pdf URL: https://arxiv.org/pdf/2505.19149
Copy Paste: [[2505.19149]] MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection(https://arxiv.org/abs/2505.19149)
Keywords: diffusion, large language model
Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated image editing techniques, driving increasing demand for diverse and fine-grained edits. Despite these advances, existing image editing methods still face challenges in achieving high precision and semantic accuracy in complex scenarios. Recent studies address this issue by incorporating multimodal large language models (MLLMs) into image editing pipelines. However, current MLLM-based methods mainly rely on interpreting textual instructions, leaving the intrinsic visual understanding of large models largely unexplored, thus resulting in insufficient alignment between textual semantics and visual outcomes. To overcome these limitations, we propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM. MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent and guide the diffusion process via generated visual embeddings. Furthermore, we propose a joint training approach to effectively integrate both strategies, allowing them to reinforce each other for more accurate instruction interpretation and visually coherent edits aligned with user intent. Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.

Title: FHGS: Feature-Homogenized Gaussian Splatting

Authors: Q. G. Duan, Benyun Zhao, Mingqiao Han Yijun Huang, Ben M. Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19154
Pdf URL: https://arxiv.org/pdf/2505.19154
Copy Paste: [[2505.19154]] FHGS: Feature-Homogenized Gaussian Splatting(https://arxiv.org/abs/2505.19154)
Keywords: robust
Abstract: Scene understanding based on 3D Gaussian Splatting (3DGS) has recently achieved notable advances. Although 3DGS related methods have efficient rendering capabilities, they fail to address the inherent contradiction between the anisotropic color representation of gaussian primitives and the isotropic requirements of semantic features, leading to insufficient cross-view feature consistency. To overcome the limitation, we proposes $\textit{FHGS}$ (Feature-Homogenized Gaussian Splatting), a novel 3D feature fusion framework inspired by physical models, which can achieve high-precision mapping of arbitrary 2D features from pre-trained models to 3D scenes while preserving the real-time rendering efficiency of 3DGS. Specifically, our $\textit{FHGS}$ introduces the following innovations: Firstly, a universal feature fusion architecture is proposed, enabling robust embedding of large-scale pre-trained models' semantic features (e.g., SAM, CLIP) into sparse 3D structures. Secondly, a non-differentiable feature fusion mechanism is introduced, which enables semantic features to exhibit viewpoint independent isotropic distributions. This fundamentally balances the anisotropic rendering of gaussian primitives and the isotropic expression of features; Thirdly, a dual-driven optimization strategy inspired by electric potential fields is proposed, which combines external supervision from semantic feature fields with internal primitive clustering guidance. This mechanism enables synergistic optimization of global semantic alignment and local structural consistency. More interactive results can be accessed on: this https URL.

Title: Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Authors: Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19155
Pdf URL: https://arxiv.org/pdf/2505.19155
Copy Paste: [[2505.19155]] Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs(https://arxiv.org/abs/2505.19155)
Keywords: large language model
Abstract: Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Title: A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation

Authors: Yuze Wang, Mariana Belgiu, Haiyang Wu, Dandan Zhong, Yangyang Cao, Chao Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19159
Pdf URL: https://arxiv.org/pdf/2505.19159
Copy Paste: [[2505.19159]] A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation(https://arxiv.org/abs/2505.19159)
Keywords: extraction, segmentation
Abstract: Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.

Title: SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

Authors: Firoj Alam, Md Arid Hasan, Shammur Absar Chowdhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19163
Pdf URL: https://arxiv.org/pdf/2505.19163
Copy Paste: [[2505.19163]] SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs(https://arxiv.org/abs/2505.19163)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (this https URL) and the experimental scripts at (this https URL) for the research community.

Title: JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19166
Pdf URL: https://arxiv.org/pdf/2505.19166
Copy Paste: [[2505.19166]] JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models(https://arxiv.org/abs/2505.19166)
Keywords: diffusion
Abstract: We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. We will publicly release the implementation of our method.

Title: EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction

Authors: Ryosei Hara, Wataru Ikeda, Masashi Hatano, Mariko Isogawa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19169
Pdf URL: https://arxiv.org/pdf/2505.19169
Copy Paste: [[2505.19169]] EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction(https://arxiv.org/abs/2505.19169)
Keywords: segmentation
Abstract: Reconstructing 3D hand mesh is challenging but an important task for human-computer interaction and AR/VR applications. In particular, RGB and/or depth cameras have been widely used in this task. However, methods using these conventional cameras face challenges in low-light environments and during motion blur. Thus, to address these limitations, event cameras have been attracting attention in recent years for their high dynamic range and high temporal resolution. Despite their advantages, event cameras are sensitive to background noise or camera motion, which has limited existing studies to static backgrounds and fixed cameras. In this study, we propose EventEgoHands, a novel method for event-based 3D hand mesh reconstruction in an egocentric view. Our approach introduces a Hand Segmentation Module that extracts hand regions, effectively mitigating the influence of dynamic background events. We evaluated our approach and demonstrated its effectiveness on the N-HOT3D dataset, improving MPJPE by approximately more than 4.5 cm (43%).

Title: Penetration Testing for System Security: Methods and Practical Approaches

Authors: Wei Zhang, Ju Xing, Xiaoqi Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19174
Pdf URL: https://arxiv.org/pdf/2505.19174
Copy Paste: [[2505.19174]] Penetration Testing for System Security: Methods and Practical Approaches(https://arxiv.org/abs/2505.19174)
Keywords: security, attack
Abstract: Penetration testing refers to the process of simulating hacker attacks to evaluate the security of information systems . This study aims not only to clarify the theoretical foundations of penetration testing but also to explain and demonstrate the complete testing process, including how network system administrators may simulate attacks using various penetration testing methods. Methodologically, the paper outlines the five basic stages of a typical penetration test: intelligence gathering, vulnerability scanning, vulnerability exploitation, privilege escalation, and post-exploitation activities. In each phase, specific tools and techniques are examined in detail, along with practical guidance on their use. To enhance the practical relevance of the study, the paper also presents a real-life case study, illustrating how a complete penetration test is conducted in a real-world environment. Through this case, readers can gain insights into the detailed procedures and applied techniques, thereby deepening their understanding of the practical value of penetration testing. Finally, the paper summarizes the importance and necessity of penetration testing in securing information systems and maintaining network integrity, and it explores future trends and development directions for the field. Overall, the findings of this paper offer valuable references for both researchers and practitioners, contributing meaningfully to the improvement of penetration testing practices and the advancement of cybersecurity as a whole.

Title: Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge

Authors: Zhuo Liu, Moxin Li, Xun Deng, Qifan Wang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19176
Pdf URL: https://arxiv.org/pdf/2505.19176
Copy Paste: [[2505.19176]] Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge(https://arxiv.org/abs/2505.19176)
Keywords: large language model
Abstract: LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at this https URL.

Title: Federated Learning: From Theory to Practice

Authors: A. Jung
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19183
Pdf URL: https://arxiv.org/pdf/2505.19183
Copy Paste: [[2505.19183]] Federated Learning: From Theory to Practice(https://arxiv.org/abs/2505.19183)
Keywords: privacy, federate
Abstract: This book offers a hands-on introduction to building and understanding federated learning (FL) systems. FL enables multiple devices -- such as smartphones, sensors, or local computers -- to collaboratively train machine learning (ML) models, while keeping their data private and local. It is a powerful solution when data cannot or should not be centralized due to privacy, regulatory, or technical reasons. The book is designed for students, engineers, and researchers who want to learn how to design scalable, privacy preserving FL systems. Our main focus is on personalization: enabling each device to train its own model while still benefiting from collaboration with relevant devices. This is achieved by leveraging similarities between (the learning tasks associated with) devices that are encoded by the weighted edges (or links) of a federated learning network (FL network). The key idea is to represent real-world FL systems as networks of devices, where nodes correspond to device and edges represent communication links and data similarities between them. The training of personalized models for these devices can be naturally framed as a distributed optimization problem. This optimization problem is referred to as generalized total variation minimization (GTVMin) and ensures that devices with similar learning tasks learn similar model parameters. Our approach is both mathematically principled and practically motivated. While we introduce some advanced ideas from optimization theory and graph-based learning, we aim to keep the book accessible. Readers are guided through the core ideas step by step, with intuitive explanations.

Title: Two LLMs debate, both are certain they've won

Authors: Minh Nhat Nguyen, Pradyumna Shyama Prasad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19184
Pdf URL: https://arxiv.org/pdf/2505.19184
Copy Paste: [[2505.19184]] Two LLMs debate, both are certain they've won(https://arxiv.org/abs/2505.19184)
Keywords: large language model
Abstract: Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.

Title: PosePilot: An Edge-AI Solution for Posture Correction in Physical Exercises

Authors: Rushiraj Gadhvi, Priyansh Desai, Siddharth
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19186
Pdf URL: https://arxiv.org/pdf/2505.19186
Copy Paste: [[2505.19186]] PosePilot: An Edge-AI Solution for Posture Correction in Physical Exercises(https://arxiv.org/abs/2505.19186)
Keywords: robust
Abstract: Automated pose correction remains a significant challenge in AI-driven fitness systems, despite extensive research in activity recognition. This work presents PosePilot, a novel system that integrates pose recognition with real-time personalized corrective feedback, overcoming the limitations of traditional fitness solutions. Using Yoga, a discipline requiring precise spatio-temporal alignment as a case study, we demonstrate PosePilot's ability to analyze complex physical movements. Designed for deployment on edge devices, PosePilot can be extended to various at-home and outdoor exercises. We employ a Vanilla LSTM, allowing the system to capture temporal dependencies for pose recognition. Additionally, a BiLSTM with multi-head Attention enhances the model's ability to process motion contexts, selectively focusing on key limb angles for accurate error detection while maintaining computational efficiency. As part of this work, we introduce a high-quality video dataset used for evaluating our models. Most importantly, PosePilot provides instant corrective feedback at every stage of a movement, ensuring precise posture adjustments throughout the exercise routine. The proposed approach 1) performs automatic human posture recognition, 2) provides personalized posture correction feedback at each instant which is crucial in Yoga, and 3) offers a lightweight and robust posture correction model feasible for deploying on edge devices in real-world environments.

Title: LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Authors: Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19187
Pdf URL: https://arxiv.org/pdf/2505.19187
Copy Paste: [[2505.19187]] LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling(https://arxiv.org/abs/2505.19187)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

Title: I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19190
Pdf URL: https://arxiv.org/pdf/2505.19190
Copy Paste: [[2505.19190]] I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts(https://arxiv.org/abs/2505.19190)
Keywords: interpretability
Abstract: Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at this https URL.

Title: Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection

Authors: Nursulu Sagimbayeva, Ruveyda Betül Bahçeci, Ingmar Weber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19191
Pdf URL: https://arxiv.org/pdf/2505.19191
Copy Paste: [[2505.19191]] Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection(https://arxiv.org/abs/2505.19191)
Keywords: large language model
Abstract: Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators' reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.

Title: Interpretable Graph Learning Over Sets of Temporally-Sparse Data

Authors: Andrea Zerio, Maya Bechler-Speicher, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19193
Pdf URL: https://arxiv.org/pdf/2505.19193
Copy Paste: [[2505.19193]] Interpretable Graph Learning Over Sets of Temporally-Sparse Data(https://arxiv.org/abs/2505.19193)
Keywords: interpretability
Abstract: Real-world medical data often includes measurements from multiple signals that are collected at irregular and asynchronous time intervals. For example, different types of blood tests can be measured at different times and frequencies, resulting in fragmented and unevenly scattered temporal data. Similar issues of irregular sampling of different attributes occur in other domains, such as monitoring of large systems using event log files or the spread of fake news on social networks. Effectively learning from such data requires models that can handle sets of temporally sparse and heterogeneous signals. In this paper, we propose Graph Mixing Additive Networks (GMAN), a novel and interpretable-by-design model for learning over irregular sets of temporal signals. Our method achieves state-of-the-art performance in real-world medical tasks, including a 4-point increase in the AUROC score of in-hospital mortality prediction, compared to existing methods. We further showcase GMAN's flexibility by applying it to a fake news detection task. We demonstrate how its interpretability capabilities, including node-level, graph-level, and subset-level importance, allow for transition phases detection and gaining medical insights with real-world high-stakes implications. Finally, we provide theoretical insights on GMAN expressive power.

Title: Curvature Dynamic Black-box Attack: revisiting adversarial robustness via dynamic curvature estimation

Authors: Peiran Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19194
Pdf URL: https://arxiv.org/pdf/2505.19194
Copy Paste: [[2505.19194]] Curvature Dynamic Black-box Attack: revisiting adversarial robustness via dynamic curvature estimation(https://arxiv.org/abs/2505.19194)
Keywords: defense, attack, robust
Abstract: Adversarial attack reveals the vulnerability of deep learning models. For about a decade, countless attack and defense methods have been proposed, leading to robustified classifiers and better understanding of models. Among these methods, curvature-based approaches have attracted attention because it is assumed that high curvature may give rise to rough decision boundary. However, the most commonly used \textit{curvature} is the curvature of loss function, scores or other parameters from within the model as opposed to decision boundary curvature, since the former can be relatively easily formed using second order derivative. In this paper, we propose a new query-efficient method, dynamic curvature estimation(DCE), to estimate the decision boundary curvature in a black-box setting. Our approach is based on CGBA, a black-box adversarial attack. By performing DCE on a wide range of classifiers, we discovered, statistically, a connection between decision boundary curvature and adversarial robustness. We also propose a new attack method, curvature dynamic black-box attack(CDBA) with improved performance using the dynamically estimated curvature.

Title: Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning

Authors: Xinyao Liao, Wei Wei, Xiaoye Qu, Yu Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19196
Pdf URL: https://arxiv.org/pdf/2505.19196
Copy Paste: [[2505.19196]] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning(https://arxiv.org/abs/2505.19196)
Keywords: diffusion
Abstract: Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step's contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.

Title: DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding

Authors: Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19201
Pdf URL: https://arxiv.org/pdf/2505.19201
Copy Paste: [[2505.19201]] DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding(https://arxiv.org/abs/2505.19201)
Keywords: large language model
Abstract: Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: this https URL

Title: OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization

Authors: Meher Bhaskar Madiraju, Meher Sai Preetam Madiraju
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19205
Pdf URL: https://arxiv.org/pdf/2505.19205
Copy Paste: [[2505.19205]] OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization(https://arxiv.org/abs/2505.19205)
Keywords: robust, large language model
Abstract: Hyperparameter optimization (HPO) is a critical yet challenging aspect of machine learning model development, significantly impacting model performance and generalization. Traditional HPO methods often struggle with high dimensionality, complex interdependencies, and computational expense. This paper introduces OptiMindTune, a novel multi-agent framework designed to intelligently and efficiently optimize hyperparameters. OptiMindTune leverages the collaborative intelligence of three specialized AI agents -- a Recommender Agent, an Evaluator Agent, and a Decision Agent -- each powered by Google's Gemini models. These agents address distinct facets of the HPO problem, from model selection and hyperparameter suggestion to robust evaluation and strategic decision-making. By fostering dynamic interactions and knowledge sharing, OptiMindTune aims to converge to optimal hyperparameter configurations more rapidly and robustly than existing single-agent or monolithic approaches. Our framework integrates principles from advanced large language models, and adaptive search to achieve scalable and intelligent AutoML. We posit that this multi-agent paradigm offers a promising avenue for tackling the increasing complexity of modern machine learning model tuning.

Title: SpeakStream: Streaming Text-to-Speech with Interleaved Data

Authors: Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19206
Pdf URL: https://arxiv.org/pdf/2505.19206
Copy Paste: [[2505.19206]] SpeakStream: Streaming Text-to-Speech with Interleaved Data(https://arxiv.org/abs/2505.19206)
Keywords: large language model
Abstract: The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.

Title: Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation

Authors: Tyler Ward, Aaron Moseley, Abdullah-Al-Zubaer Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19208
Pdf URL: https://arxiv.org/pdf/2505.19208
Copy Paste: [[2505.19208]] Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation(https://arxiv.org/abs/2505.19208)
Keywords: segmentation
Abstract: Segmentation is one of the most important tasks in the medical imaging pipeline as it influences a number of image-based decisions. To be effective, fully supervised segmentation approaches require large amounts of manually annotated training data. However, the pixel-level annotation process is expensive, time-consuming, and error-prone, hindering progress and making it challenging to perform effective segmentations. Therefore, models must learn efficiently from limited labeled data. Self-supervised learning (SSL), particularly contrastive learning via pre-training on unlabeled data and fine-tuning on limited annotations, can facilitate such limited labeled image segmentation. To this end, we propose a novel self-supervised contrastive learning framework for medical image segmentation, leveraging inherent relationships of different images, dubbed PolyCL. Without requiring any pixel-level annotations or unreasonable data augmentations, our PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate, in a task-related manner. Additionally, we integrate the Segment Anything Model (SAM) into our framework in two novel ways: as a post-processing refinement module that improves the accuracy of predicted masks using bounding box prompts derived from coarse outputs, and as a propagation mechanism via SAM 2 that generates volumetric segmentations from a single annotated 2D slice. Experimental evaluations on three public computed tomography (CT) datasets demonstrate that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios. Our code is available at this https URL.

Title: MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Subjects: cs.CL, cs.AI, cs.CE, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19209
Pdf URL: https://arxiv.org/pdf/2505.19209
Copy Paste: [[2505.19209]] MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search(https://arxiv.org/abs/2505.19209)
Keywords: large language model
Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.

Title: Towards Understanding the Mechanisms of Classifier-Free Guidance

Authors: Xiang Li, Rongrong Wang, Qing Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19210
Pdf URL: https://arxiv.org/pdf/2505.19210
Copy Paste: [[2505.19210]] Towards Understanding the Mechanisms of Classifier-Free Guidance(https://arxiv.org/abs/2505.19210)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism in the nonlinear regime.

Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Authors: Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.19212
Pdf URL: https://arxiv.org/pdf/2505.19212
Copy Paste: [[2505.19212]] When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas(https://arxiv.org/abs/2505.19212)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at this https URL.

Title: The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

Authors: Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19217
Pdf URL: https://arxiv.org/pdf/2505.19217
Copy Paste: [[2505.19217]] The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training(https://arxiv.org/abs/2505.19217)
Keywords: large language model
Abstract: Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.

Title: LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Authors: Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19223
Pdf URL: https://arxiv.org/pdf/2505.19223
Copy Paste: [[2505.19223]] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models(https://arxiv.org/abs/2505.19223)
Keywords: diffusion
Abstract: While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: this https URL.

Title: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Authors: Frederik Kunstner, Francis Bach
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2505.19227
Pdf URL: https://arxiv.org/pdf/2505.19227
Copy Paste: [[2505.19227]] Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law(https://arxiv.org/abs/2505.19227)
Keywords: transformer
Abstract: Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $\pi_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $\pi_k \propto 1/k^\alpha$ parameterized by the exponent $\alpha > 0$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $\alpha$. Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent $\alpha > 1$. This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $\alpha = 1$ as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.

Title: RAISE: Realness Assessment for Image Synthesis and Evaluation

Authors: Aniruddha Mukherjee, Spriha Dubey, Somdyuti Paul
Subjects: cs.CV, cs.AI, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19233
Pdf URL: https://arxiv.org/pdf/2505.19233
Copy Paste: [[2505.19233]] RAISE: Realness Assessment for Image Synthesis and Evaluation(https://arxiv.org/abs/2505.19233)
Keywords: robust, generative
Abstract: The rapid advancement of generative AI has enabled the creation of highly photorealistic visual content, offering practical substitutes for real images and videos in scenarios where acquiring real data is difficult or expensive. However, reliably substituting real visual content with AI-generated counterparts requires robust assessment of the perceived realness of AI-generated visual content, a challenging task due to its inherent subjective nature. To address this, we conducted a comprehensive human study evaluating the perceptual realness of both real and AI-generated images, resulting in a new dataset, containing images paired with subjective realness scores, introduced as RAISE in this paper. Further, we develop and train multiple models on RAISE to establish baselines for realness prediction. Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness. RAISE thus provides a valuable resource for developing robust, objective models of perceptual realness assessment.

Title: Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19236
Pdf URL: https://arxiv.org/pdf/2505.19236
Copy Paste: [[2505.19236]] Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator(https://arxiv.org/abs/2505.19236)
Keywords: robust, large language model
Abstract: Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

Title: Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Authors: Sourav Ganguly, Arnob Ghosh, Kishan Panaganti, Adam Wierman
Subjects: cs.LG, cs.AI, cs.RO, eess.SY
Abstract URL: https://arxiv.org/abs/2505.19238
Pdf URL: https://arxiv.org/pdf/2505.19238
Copy Paste: [[2505.19238]] Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees(https://arxiv.org/abs/2505.19238)
Keywords: robust
Abstract: Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $\epsilon$ sub-optimality and feasible policy after $O(\epsilon^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($\gamma$) and by at least 6x for larger value of $\gamma$.

Title: DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Authors: Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19239
Pdf URL: https://arxiv.org/pdf/2505.19239
Copy Paste: [[2505.19239]] DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving(https://arxiv.org/abs/2505.19239)
Keywords: robust
Abstract: Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

Title: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Authors: Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19240
Pdf URL: https://arxiv.org/pdf/2505.19240
Copy Paste: [[2505.19240]] LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models(https://arxiv.org/abs/2505.19240)
Keywords: security, large language model
Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.

Title: ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Authors: Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19241
Pdf URL: https://arxiv.org/pdf/2505.19241
Copy Paste: [[2505.19241]] ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment(https://arxiv.org/abs/2505.19241)
Keywords: large language model
Abstract: The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

Title: Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model

Authors: Alaa Dalaq, Muzammil Behzad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19242
Pdf URL: https://arxiv.org/pdf/2505.19242
Copy Paste: [[2505.19242]] Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model(https://arxiv.org/abs/2505.19242)
Keywords: segmentation
Abstract: Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.

Title: To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

Authors: Kevin Xu, Issei Sato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19245
Pdf URL: https://arxiv.org/pdf/2505.19245
Copy Paste: [[2505.19245]] To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers(https://arxiv.org/abs/2505.19245)
Keywords: transformer
Abstract: Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.

Title: Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Authors: Tao Wang, Ruipeng Zhang, Sicun Gao
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.19247
Pdf URL: https://arxiv.org/pdf/2505.19247
Copy Paste: [[2505.19247]] Improving Value Estimation Critically Enhances Vanilla Policy Gradient(https://arxiv.org/abs/2505.19247)
Keywords: robust
Abstract: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Title: Unveiling Dual Quality in Product Reviews: An NLP-Based Approach

Authors: Rafał Poświata, Marcin Michał Mirończuk, Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19254
Pdf URL: https://arxiv.org/pdf/2505.19254
Copy Paste: [[2505.19254]] Unveiling Dual Quality in Product Reviews: An NLP-Based Approach(https://arxiv.org/abs/2505.19254)
Keywords: robust, transformer
Abstract: Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.

Title: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Authors: Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19255
Pdf URL: https://arxiv.org/pdf/2505.19255
Copy Paste: [[2505.19255]] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use(https://arxiv.org/abs/2505.19255)
Keywords: large language model
Abstract: Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.

Title: PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms

Authors: Vivek Gopalakrishnan, Neel Dey, Polina Golland
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2505.19256
Pdf URL: https://arxiv.org/pdf/2505.19256
Copy Paste: [[2505.19256]] PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms(https://arxiv.org/abs/2505.19256)
Keywords: robust
Abstract: Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.

Title: Towards Large Reasoning Models for Agriculture

Authors: Hossein Zaremehrjerdi, Shreyan Ganguly, Ashlyn Rairdin, Elizabeth Tranel, Benjamin Feuer, Juan Ignacio Di Salvo, Srikanth Panthulugiri, Victoria Moser, Sarah Jones, Joscif G Raigne, Yanben Shen, Heidi M. Dornath, Aditya Balu, Adarsh Krishnamurthy, Asheesh K Singh, Arti Singh, Baskar Ganapathysubramanian, Chinmay Hegde, Soumik Sarkar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19259
Pdf URL: https://arxiv.org/pdf/2505.19259
Copy Paste: [[2505.19259]] Towards Large Reasoning Models for Agriculture(https://arxiv.org/abs/2505.19259)
Keywords: large language model
Abstract: Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: this https URL

Title: ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense

Authors: Shiyu Xiang, Tong Zhang, Ronghao Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19260
Pdf URL: https://arxiv.org/pdf/2505.19260
Copy Paste: [[2505.19260]] ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense(https://arxiv.org/abs/2505.19260)
Keywords: defense, robust
Abstract: LLM Agents are becoming central to intelligent systems. However, their deployment raises serious safety concerns. Existing defenses largely rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors - creating a significant semantic gap between safety checks and real-world risks. To bridge this gap, we propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning). ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, substantially enhancing robustness without retraining the base LLM, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency. Experimental results demonstrate that our approach achieves superior overall performance compared to existing baselines, achieving a best-in-class average accuracy of 80% and exhibiting strong generalizability across agents and tasks.

Title: Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Authors: Yu Zhang, Jialei Zhou, Xinchen Li, Qi Zhang, Zhongwei Wan, Tianyu Wang, Duoqian Miao, Changwei Wang, Longbing Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19261
Pdf URL: https://arxiv.org/pdf/2505.19261
Copy Paste: [[2505.19261]] Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning(https://arxiv.org/abs/2505.19261)
Keywords: diffusion, transformer, large language model
Abstract: Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.

Title: Cellular Traffic Prediction via Byzantine-robust Asynchronous Federated Learning

Authors: Hui Ma, Kai Yang, Yang Jiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19263
Pdf URL: https://arxiv.org/pdf/2505.19263
Copy Paste: [[2505.19263]] Cellular Traffic Prediction via Byzantine-robust Asynchronous Federated Learning(https://arxiv.org/abs/2505.19263)
Keywords: privacy, attack, robust, federate
Abstract: Network traffic prediction plays a crucial role in intelligent network operation. Traditional prediction methods often rely on centralized training, necessitating the transfer of vast amounts of traffic data to a central server. This approach can lead to latency and privacy concerns. To address these issues, federated learning integrated with differential privacy has emerged as a solution to improve data privacy and model robustness in distributed settings. Nonetheless, existing federated learning protocols are vulnerable to Byzantine attacks, which may significantly compromise model robustness. Developing a robust and privacy-preserving prediction model in the presence of Byzantine clients remains a significant challenge. To this end, we propose an asynchronous differential federated learning framework based on distributionally robust optimization. The proposed framework utilizes multiple clients to train the prediction model collaboratively with local differential privacy. In addition, regularization techniques have been employed to further improve the Byzantine robustness of the models. We have conducted extensive experiments on three real-world datasets, and the results elucidate that our proposed distributed algorithm can achieve superior performance over existing methods.

Title: Improving Novel view synthesis of 360$^\circ$ Scenes in Extremely Sparse Views by Jointly Training Hemisphere Sampled Synthetic Images

Authors: Guangan Chen, Anh Minh Truong, Hanhe Lin, Michiel Vlaminck, Wilfried Philips, Hiep Luong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19264
Pdf URL: https://arxiv.org/pdf/2505.19264
Copy Paste: [[2505.19264]] Improving Novel view synthesis of 360$^\circ$ Scenes in Extremely Sparse Views by Jointly Training Hemisphere Sampled Synthetic Images(https://arxiv.org/abs/2505.19264)
Keywords: diffusion
Abstract: Novel view synthesis in 360$^\circ$ scenes from extremely sparse input views is essential for applications like virtual reality and augmented reality. This paper presents a novel framework for novel view synthesis in extremely sparse-view cases. As typical structure-from-motion methods are unable to estimate camera poses in extremely sparse-view cases, we apply DUSt3R to estimate camera poses and generate a dense point cloud. Using the poses of estimated cameras, we densely sample additional views from the upper hemisphere space of the scenes, from which we render synthetic images together with the point cloud. Training 3D Gaussian Splatting model on a combination of reference images from sparse views and densely sampled synthetic images allows a larger scene coverage in 3D space, addressing the overfitting challenge due to the limited input in sparse-view cases. Retraining a diffusion-based image enhancement model on our created dataset, we further improve the quality of the point-cloud-rendered images by removing artifacts. We compare our framework with benchmark methods in cases of only four input views, demonstrating significant improvement in novel view synthesis under extremely sparse-view conditions for 360$^\circ$ scenes.

Title: A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

Authors: Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19281
Pdf URL: https://arxiv.org/pdf/2505.19281
Copy Paste: [[2505.19281]] A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning(https://arxiv.org/abs/2505.19281)
Keywords: interpretability, large language model
Abstract: Online reinforcement learning (RL) excels in complex, safety-critical domains, yet it faces challenges such as sample inefficiency, training instability, and a lack of interpretability. Data attribution offers a principled way to trace model behavior back to individual training samples. However, in online RL, each training sample not only drives policy updates but also influences future data collection, violating the fixed dataset assumption in existing attribution methods. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a local attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Overall, these results advance interpretability, efficiency, and effectiveness of online RL.

Title: BSAGIoT: A Bayesian Security Aspect Graph for Internet of Things (IoT)

Authors: Zeinab Lashkaripour, Masoud Khosravi-Farmad, AhmadReza Montazerolghaem, Razieh Rezaee
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2505.19283
Pdf URL: https://arxiv.org/pdf/2505.19283
Copy Paste: [[2505.19283]] BSAGIoT: A Bayesian Security Aspect Graph for Internet of Things (IoT)(https://arxiv.org/abs/2505.19283)
Keywords: security, privacy, attack
Abstract: IoT is a dynamic network of interconnected things that communicate and exchange data, where security is a significant issue. Previous studies have mainly focused on attack classifications and open issues rather than presenting a comprehensive overview on the existing threats and vulnerabilities. This knowledge helps analyzing the network in the early stages even before any attack takes place. In this paper, the researchers have proposed different security aspects and a novel Bayesian Security Aspects Dependency Graph for IoT (BSAGIoT) to illustrate their relations. The proposed BSAGIoT is a generic model applicable to any IoT network and contains aspects from five categories named data, access control, standard, network, and loss. This proposed Bayesian Security Aspect Graph (BSAG) presents an overview of the security aspects in any given IoT network. The purpose of BSAGIoT is to assist security experts in analyzing how a successful compromise and/or a failed breach could impact the overall security and privacy of the respective IoT network. In addition, root cause identification of security challenges, how they affect one another, their impact on IoT networks via topological sorting, and risk assessment could be achieved. Hence, to demonstrate the feasibility of the proposed method, experimental results with various scenarios has been presented, in which the security aspects have been quantified based on the network configurations. The results indicate the impact of the aspects on each other and how they could be utilized to mitigate and/or eliminate the security and privacy deficiencies in IoT networks.

Title: A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

Authors: Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang
Subjects: cs.CL, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2505.19286
Pdf URL: https://arxiv.org/pdf/2505.19286
Copy Paste: [[2505.19286]] A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models(https://arxiv.org/abs/2505.19286)
Keywords: explainability, large language model
Abstract: Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.

Title: Hypercube-RAG: Hypercube-Based Retrieval-Augmented Generation for In-domain Scientific Question-Answering

Authors: Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Shaowen Wang, Giri Narasimhan, Jiawei Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19288
Pdf URL: https://arxiv.org/pdf/2505.19288
Copy Paste: [[2505.19288]] Hypercube-RAG: Hypercube-Based Retrieval-Augmented Generation for In-domain Scientific Question-Answering(https://arxiv.org/abs/2505.19288)
Keywords: explainability, large language model
Abstract: Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG), which empowers LLMs to generate more qualified responses with retrieved external data and knowledge, has shown its high promise. However, traditional semantic similarity-based RAGs struggle to return concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). Built on a multi-dimensional (cube) structure called Hypercube, which can index documents in an application-driven, human-defined, multi-dimensional space, we introduce the Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities and topics and then retrieves relevant documents from cubes by aligning these decomposed components with hypercube dimensions. Experiments on three in-domain scientific QA datasets demonstrate that our method improves accuracy by 3.7% and boosts retrieval efficiency by 81.2%, measured as relative gains over the strongest RAG baseline. More importantly, our Hypercube-RAG inherently offers explainability by revealing the underlying predefined hypercube dimensions used for retrieval. The code and data sets are available at this https URL.

Title: TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Authors: Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19291
Pdf URL: https://arxiv.org/pdf/2505.19291
Copy Paste: [[2505.19291]] TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis(https://arxiv.org/abs/2505.19291)
Keywords: robust, diffusion
Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.

Title: Alchemist: Turning Public Text-to-Image Data into Generative Gold

Authors: Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19297
Pdf URL: https://arxiv.org/pdf/2505.19297
Copy Paste: [[2505.19297]] Alchemist: Turning Public Text-to-Image Data into Generative Gold(https://arxiv.org/abs/2505.19297)
Keywords: generative
Abstract: Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

Title: A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

Authors: Lingjun Zhao, Hal Daumé III
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19299
Pdf URL: https://arxiv.org/pdf/2505.19299
Copy Paste: [[2505.19299]] A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations(https://arxiv.org/abs/2505.19299)
Keywords: large language model
Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.

Title: SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking

Authors: Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19300
Pdf URL: https://arxiv.org/pdf/2505.19300
Copy Paste: [[2505.19300]] SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking(https://arxiv.org/abs/2505.19300)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs' access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at this https URL.

Title: A Novel Zero-Trust Identity Framework for Agentic AI: Decentralized Authentication and Fine-Grained Access Control

Authors: Ken Huang, Vineeth Sai Narajala, John Yeoh, Ramesh Raskar, Youssef Harkati, Jerry Huang, Idan Habler, Chris Hughes
Subjects: cs.CR, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19301
Pdf URL: https://arxiv.org/pdf/2505.19301
Copy Paste: [[2505.19301]] A Novel Zero-Trust Identity Framework for Agentic AI: Decentralized Authentication and Fine-Grained Access Control(https://arxiv.org/abs/2505.19301)
Keywords: secure, security, privacy
Abstract: Traditional Identity and Access Management (IAM) systems, primarily designed for human users or static machine identities via protocols such as OAuth, OpenID Connect (OIDC), and SAML, prove fundamentally inadequate for the dynamic, interdependent, and often ephemeral nature of AI agents operating at scale within Multi Agent Systems (MAS), a computational system composed of multiple interacting intelligent agents that work collectively. This paper posits the imperative for a novel Agentic AI IAM framework: We deconstruct the limitations of existing protocols when applied to MAS, illustrating with concrete examples why their coarse-grained controls, single-entity focus, and lack of context-awareness falter. We then propose a comprehensive framework built upon rich, verifiable Agent Identities (IDs), leveraging Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), that encapsulate an agents capabilities, provenance, behavioral scope, and security posture. Our framework includes an Agent Naming Service (ANS) for secure and capability-aware discovery, dynamic fine-grained access control mechanisms, and critically, a unified global session management and policy enforcement layer for real-time control and consistent revocation across heterogeneous agent communication protocols. We also explore how Zero-Knowledge Proofs (ZKPs) enable privacy-preserving attribute disclosure and verifiable policy compliance. We outline the architecture, operational lifecycle, innovative contributions, and security considerations of this new IAM paradigm, aiming to establish the foundational trust, accountability, and security necessary for the burgeoning field of agentic AI and the complex ecosystems they will inhabit.

Title: Concept Reachability in Diffusion Models: Beyond Dataset Constraints

Authors: Marta Aparicio Rodriguez, Xenia Miscouridou, Anastasia Borovykh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19313
Pdf URL: https://arxiv.org/pdf/2505.19313
Copy Paste: [[2505.19313]] Concept Reachability in Diffusion Models: Beyond Dataset Constraints(https://arxiv.org/abs/2505.19313)
Keywords: diffusion
Abstract: Despite significant advances in quality and complexity of the generations in text-to-image models, prompting does not always lead to the desired outputs. Controlling model behaviour by directly steering intermediate model activations has emerged as a viable alternative allowing to reach concepts in latent space that may otherwise remain inaccessible by prompt. In this work, we introduce a set of experiments to deepen our understanding of concept reachability. We design a training data setup with three key obstacles: scarcity of concepts, underspecification of concepts in the captions, and data biases with tied concepts. Our results show: (i) concept reachability in latent space exhibits a distinct phase transition, with only a small number of samples being sufficient to enable reachability, (ii) where in the latent space the intervention is performed critically impacts reachability, showing that certain concepts are reachable only at certain stages of transformation, and (iii) while prompting ability rapidly diminishes with a decrease in quality of the dataset, concepts often remain reliably reachable through steering. Model providers can leverage this to bypass costly retraining and dataset curation and instead innovate with user-facing control mechanisms.

Title: Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales

Authors: Charles Godfrey, Ping Nie, Natalia Ostapuk, David Ken, Shang Gao, Souheil Inati
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2505.19334
Pdf URL: https://arxiv.org/pdf/2505.19334
Copy Paste: [[2505.19334]] Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales(https://arxiv.org/abs/2505.19334)
Keywords: large language model
Abstract: Large language models (LLMs) obtain state of the art zero shot relevance ranking performance on a variety of information retrieval tasks. The two most common prompts to elicit LLM relevance judgments are pointwise scoring (a.k.a. relevance generation), where the LLM sees a single query-document pair and outputs a single relevance score, and listwise ranking (a.k.a. permutation generation), where the LLM sees a query and a list of documents and outputs a permutation, sorting the documents in decreasing order of relevance. The current research community consensus is that listwise ranking yields superior performance, and significant research effort has been devoted to crafting LLM listwise ranking algorithms. The underlying hypothesis is that LLMs are better at making relative relevance judgments than absolute ones. In tension with this hypothesis, we find that the gap between pointwise scoring and listwise ranking shrinks when pointwise scoring is implemented using a sufficiently large ordinal relevance label space, becoming statistically insignificant for many LLM-benchmark dataset combinations (where ``significant'' means ``95\% confidence that listwise ranking improves NDCG@10''). Our evaluations span four LLMs, eight benchmark datasets from the BEIR and TREC-DL suites, and two proprietary datasets with relevance labels collected after the training cut-off of all LLMs evaluated.

Title: Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies

Authors: Kevin Li, Marinka Zitnik
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.19337
Pdf URL: https://arxiv.org/pdf/2505.19337
Copy Paste: [[2505.19337]] Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies(https://arxiv.org/abs/2505.19337)
Keywords: transformer
Abstract: Offline goal-conditioned reinforcement learning methods have shown promise for reach-avoid tasks, where an agent must reach a target state while avoiding undesirable regions of the state space. Existing approaches typically encode avoid-region information into an augmented state space and cost function, which prevents flexible, dynamic specification of novel avoid-region information at evaluation time. They also rely heavily on well-designed reward and cost functions, limiting scalability to complex or poorly structured environments. We introduce RADT, a decision transformer model for offline, reward-free, goal-conditioned, avoid region-conditioned RL. RADT encodes goals and avoid regions directly as prompt tokens, allowing any number of avoid regions of arbitrary size to be specified at evaluation time. Using only suboptimal offline trajectories from a random policy, RADT learns reach-avoid behavior through a novel combination of goal and avoid-region hindsight relabeling. We benchmark RADT against 3 existing offline goal-conditioned RL models across 11 tasks, environments, and experimental settings. RADT generalizes in a zero-shot manner to out-of-distribution avoid region sizes and counts, outperforming baselines that require retraining. In one such zero-shot setting, RADT achieves 35.7% improvement in normalized cost over the best retrained baseline while maintaining high goal-reaching success. We apply RADT to cell reprogramming in biology, where it reduces visits to undesirable intermediate gene expression states during trajectories to desired target states, despite stochastic transitions and discrete, structured state dynamics.

Title: Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

Authors: Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19342
Pdf URL: https://arxiv.org/pdf/2505.19342
Copy Paste: [[2505.19342]] Communication-Efficient Multi-Device Inference Acceleration for Transformer Models(https://arxiv.org/abs/2505.19342)
Keywords: transformer
Abstract: Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at this https URL.

Title: PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

Authors: Yongmin Yoo, Qiongkai Xu, Longbing Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19345
Pdf URL: https://arxiv.org/pdf/2505.19345
Copy Paste: [[2505.19345]] PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims(https://arxiv.org/abs/2505.19345)
Keywords: protect, robust, large language model
Abstract: Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of $r = 0.819$ with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.

Title: Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Yanning Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19352
Pdf URL: https://arxiv.org/pdf/2505.19352
Copy Paste: [[2505.19352]] Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions(https://arxiv.org/abs/2505.19352)
Keywords: generative
Abstract: Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.

Title: GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

Authors: Mohammad Mahdi Moradi, Sudhir Mudur
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19354
Pdf URL: https://arxiv.org/pdf/2505.19354
Copy Paste: [[2505.19354]] GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance(https://arxiv.org/abs/2505.19354)
Keywords: large language model
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.

Title: ChartLens: Fine-grained Visual Attribution in Charts

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19360
Pdf URL: https://arxiv.org/pdf/2505.19360
Copy Paste: [[2505.19360]] ChartLens: Fine-grained Visual Attribution in Charts(https://arxiv.org/abs/2505.19360)
Keywords: large language model, segmentation
Abstract: The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.

Title: RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks

Authors: Amit Chakraborty, Sayyed Farid Ahamed, Sandip Roy, Soumya Banerjee, Kevin Choi, Abdul Rahman, Alison Hu, Edward Bowen, Sachin Shetty
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19364
Pdf URL: https://arxiv.org/pdf/2505.19364
Copy Paste: [[2505.19364]] RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks(https://arxiv.org/abs/2505.19364)
Keywords: security, protect, defense, attack, extraction, watermark
Abstract: Machine Learning as a Service (MLaaS) enables users to leverage powerful machine learning models through cloud-based APIs, offering scalability and ease of deployment. However, these services are vulnerable to model extraction attacks, where adversaries repeatedly query the application programming interface (API) to reconstruct a functionally similar model, compromising intellectual property and security. Despite various defense strategies being proposed, many suffer from high computational costs, limited adaptability to evolving attack techniques, and a reduction in performance for legitimate users. In this paper, we introduce a Resilient Adaptive Defense Framework for Model Extraction Attack Protection (RADEP), a multifaceted defense framework designed to counteract model extraction attacks through a multi-layered security approach. RADEP employs progressive adversarial training to enhance model resilience against extraction attempts. Malicious query detection is achieved through a combination of uncertainty quantification and behavioral pattern analysis, effectively identifying adversarial queries. Furthermore, we develop an adaptive response mechanism that dynamically modifies query outputs based on their suspicion scores, reducing the utility of stolen models. Finally, ownership verification is enforced through embedded watermarking and backdoor triggers, enabling reliable identification of unauthorized model use. Experimental evaluations demonstrate that RADEP significantly reduces extraction success rates while maintaining high detection accuracy with minimal impact on legitimate queries. Extensive experiments show that RADEP effectively defends against model extraction attacks and remains resilient even against adaptive adversaries, making it a reliable security framework for MLaaS models.

Title: SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition

Authors: Yunbo Liu, Xukui Qin, Yifan Gao, Xiang Li, Chengwei Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19369
Pdf URL: https://arxiv.org/pdf/2505.19369
Copy Paste: [[2505.19369]] SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition(https://arxiv.org/abs/2505.19369)
Keywords: robust, transformer
Abstract: Human Activity Recognition (HAR) using wearable sensor data has become a central task in mobile computing, healthcare, and human-computer interaction. Despite the success of traditional deep learning models such as CNNs and RNNs, they often struggle to capture long-range temporal dependencies and contextual relevance across multiple sensor channels. To address these limitations, we propose SETransformer, a hybrid deep neural architecture that combines Transformer-based temporal modeling with channel-wise squeeze-and-excitation (SE) attention and a learnable temporal attention pooling mechanism. The model takes raw triaxial accelerometer data as input and leverages global self-attention to capture activity-specific motion dynamics over extended time windows, while adaptively emphasizing informative sensor channels and critical time steps. We evaluate SETransformer on the WISDM dataset and demonstrate that it significantly outperforms conventional models including LSTM, GRU, BiLSTM, and CNN baselines. The proposed model achieves a validation accuracy of 84.68\% and a macro F1-score of 84.64\%, surpassing all baseline architectures by a notable margin. Our results show that SETransformer is a competitive and interpretable solution for real-world HAR tasks, with strong potential for deployment in mobile and ubiquitous sensing applications.

Title: DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

Authors: Niloufar Alipour Talemi, Hossein Kashiani, Hossein R. Nowdeh, Fatemeh Afghah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19373
Pdf URL: https://arxiv.org/pdf/2505.19373
Copy Paste: [[2505.19373]] DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models(https://arxiv.org/abs/2505.19373)
Keywords: robust
Abstract: Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype features in a directional manner to prioritize consistency in feature orientation over strict proximity. This approach ensures robust generalization by leveraging stable prototype directions derived from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot learning.

Title: Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality

Authors: Lance Ying, Almog Hillel, Ryan Truong, Vikash K. Mansinghka, Joshua B. Tenenbaum, Tan Zhi-Xuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19376
Pdf URL: https://arxiv.org/pdf/2505.19376
Copy Paste: [[2505.19376]] Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality(https://arxiv.org/abs/2505.19376)
Keywords: generative
Abstract: A key feature of human theory-of-mind is the ability to attribute beliefs to other agents as mentalistic explanations for their behavior. But given the wide variety of beliefs that agents may hold about the world and the rich language we can use to express them, which specific beliefs are people inclined to attribute to others? In this paper, we investigate the hypothesis that people prefer to attribute beliefs that are good explanations for the behavior they observe. We develop a computational model that quantifies the explanatory strength of a (natural language) statement about an agent's beliefs via three factors: accuracy, informativity, and causal relevance to actions, each of which can be computed from a probabilistic generative model of belief-driven behavior. Using this model, we study the role of each factor in how people selectively attribute beliefs to other agents. We investigate this via an experiment where participants watch an agent collect keys hidden in boxes in order to reach a goal, then rank a set of statements describing the agent's beliefs about the boxes' contents. We find that accuracy and informativity perform reasonably well at predicting these rankings when combined, but that causal relevance is the single factor that best explains participants' responses.

Title: Absolute Coordinates Make Motion Generation Easy

Authors: Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, Huaizu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19377
Pdf URL: https://arxiv.org/pdf/2505.19377
Copy Paste: [[2505.19377]] Absolute Coordinates Make Motion Generation Easy(https://arxiv.org/abs/2505.19377)
Keywords: diffusion, transformer
Abstract: State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

Title: GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

Authors: Seokgi Lee, Jungjun Kim
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19384
Pdf URL: https://arxiv.org/pdf/2505.19384
Copy Paste: [[2505.19384]] GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor(https://arxiv.org/abs/2505.19384)
Keywords: robust, interpretability
Abstract: We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.

Title: Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion

Authors: Jiaqi Guo, Santiago Lopez-Tapia, Aggelos K. Katsaggelos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19385
Pdf URL: https://arxiv.org/pdf/2505.19385
Copy Paste: [[2505.19385]] Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion(https://arxiv.org/abs/2505.19385)
Keywords: diffusion
Abstract: Limited Angle Computed Tomography (LACT) often faces significant challenges due to missing angular information. Unlike previous methods that operate in the image domain, we propose a new method that focuses on sinogram inpainting. We leverage MR-SDEs, a variant of diffusion models that characterize the diffusion process with mean-reverting stochastic differential equations, to fill in missing angular data at the projection level. Furthermore, by combining distillation with constraining the output of the model using the pseudo-inverse of the inpainting matrix, the diffusion process is accelerated and done in a step, enabling efficient and accurate sinogram completion. A subsequent post-processing module back-projects the inpainted sinogram into the image domain and further refines the reconstruction, effectively suppressing artifacts while preserving critical structural details. Quantitative experimental results demonstrate that the proposed method achieves state-of-the-art performance in both perceptual and fidelity quality, offering a promising solution for LACT reconstruction in scientific and clinical applications.

Title: Alignment of large language models with constrained learning

Authors: Botong Zhang, Shuo Li, Ignacio Hounie, Osbert Bastani, Dongsheng Ding, Alejandro Ribeiro
Subjects: cs.LG, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2505.19387
Pdf URL: https://arxiv.org/pdf/2505.19387
Copy Paste: [[2505.19387]] Alignment of large language models with constrained learning(https://arxiv.org/abs/2505.19387)
Keywords: large language model
Abstract: We study the problem of computing an optimal large language model (LLM) policy for a constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF dataset.

Title: gec-metrics: A Unified Library for Grammatical Error Correction Evaluation

Authors: Takumi Goto, Yusuke Sakai, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19388
Pdf URL: https://arxiv.org/pdf/2505.19388
Copy Paste: [[2505.19388]] gec-metrics: A Unified Library for Grammatical Error Correction Evaluation(https://arxiv.org/abs/2505.19388)
Keywords: fair
Abstract: We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.

Title: VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Authors: Ethan TS. Liu, Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19395
Pdf URL: https://arxiv.org/pdf/2505.19395
Copy Paste: [[2505.19395]] VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation(https://arxiv.org/abs/2505.19395)
Keywords: secure, security, robust, large language model
Abstract: Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: this https URL

Title: Are Time-Series Foundation Models Deployment-Ready? A Systematic Study of Adversarial Robustness Across Domains

Authors: Jiawen Zhang, Zhenwei Zhang, Shun Zheng, Xumeng Wen, Jia Li, Jiang Bian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19397
Pdf URL: https://arxiv.org/pdf/2505.19397
Copy Paste: [[2505.19397]] Are Time-Series Foundation Models Deployment-Ready? A Systematic Study of Adversarial Robustness Across Domains(https://arxiv.org/abs/2505.19397)
Keywords: attack, robust
Abstract: Time Series Foundation Models (TSFMs), which are pretrained on large-scale, cross-domain data and capable of zero-shot forecasting in new scenarios without further training, are increasingly adopted in real-world applications. However, as the zero-shot forecasting paradigm gets popular, a critical yet overlooked question emerges: Are TSFMs robust to adversarial input perturbations? Such perturbations could be exploited in man-in-the-middle attacks or data poisoning. To address this gap, we conduct a systematic investigation into the adversarial robustness of TSFMs. Our results show that even minimal perturbations can induce significant and controllable changes in forecast behaviors, including trend reversal, temporal drift, and amplitude shift, posing serious risks to TSFM-based services. Through experiments on representative TSFMs and multiple datasets, we reveal their consistent vulnerabilities and identify potential architectural designs, such as structural sparsity and multi-task pretraining, that may improve robustness. Our findings offer actionable guidance for designing more resilient forecasting systems and provide a critical assessment of the adversarial robustness of TSFMs.

Title: Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

Authors: Yiwei Xie, Ping Liu, Zheng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19398
Pdf URL: https://arxiv.org/pdf/2505.19398
Copy Paste: [[2505.19398]] Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression(https://arxiv.org/abs/2505.19398)
Keywords: robust, diffusion, generative
Abstract: Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.

Title: Exploring the Possibility of TypiClust for Low-Budget Federated Active Learning

Authors: Yuta Ono, Hiroshi Nakamura, Hideki Takase
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19404
Pdf URL: https://arxiv.org/pdf/2505.19404
Copy Paste: [[2505.19404]] Exploring the Possibility of TypiClust for Low-Budget Federated Active Learning(https://arxiv.org/abs/2505.19404)
Keywords: extraction, federate
Abstract: Federated Active Learning (FAL) seeks to reduce the burden of annotation under the realistic constraints of federated learning by leveraging Active Learning (AL). As FAL settings make it more expensive to obtain ground truth labels, FAL strategies that work well in low-budget regimes, where the amount of annotation is very limited, are needed. In this work, we investigate the effectiveness of TypiClust, a successful low-budget AL strategy, in low-budget FAL settings. Our empirical results show that TypiClust works well even in low-budget FAL settings contrasted with relatively low performances of other methods, although these settings present additional challenges, such as data heterogeneity, compared to AL. In addition, we show that FAL settings cause distribution shifts in terms of typicality, but TypiClust is not very vulnerable to the shifts. We also analyze the sensitivity of TypiClust to feature extraction methods, and it suggests a way to perform FAL even in limited data situations.

Title: CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

Authors: Yan Wen, Junfeng Guo, Heng Huang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.19405
Pdf URL: https://arxiv.org/pdf/2505.19405
Copy Paste: [[2505.19405]] CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems(https://arxiv.org/abs/2505.19405)
Keywords: protect, large language model
Abstract: As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.

Title: Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering

Authors: Jiajun Zhu, Ye Liu, Meikai Bao, Kai Zhang, Yanghai Zhang, Qi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19410
Pdf URL: https://arxiv.org/pdf/2505.19410
Copy Paste: [[2505.19410]] Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering(https://arxiv.org/abs/2505.19410)
Keywords: large language model
Abstract: Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.

Title: ADD-SLAM: Adaptive Dynamic Dense SLAM with Gaussian Splatting

Authors: Wenhua Wu, Chenpeng Su, Siting Zhu, Tianchen Deng, Zhe Liu, Hesheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19420
Pdf URL: https://arxiv.org/pdf/2505.19420
Copy Paste: [[2505.19420]] ADD-SLAM: Adaptive Dynamic Dense SLAM with Gaussian Splatting(https://arxiv.org/abs/2505.19420)
Keywords: segmentation
Abstract: Recent advancements in Neural Radiance Fields (NeRF) and 3D Gaussian-based Simultaneous Localization and Mapping (SLAM) methods have demonstrated exceptional localization precision and remarkable dense mapping performance. However, dynamic objects introduce critical challenges by disrupting scene consistency, leading to tracking drift and mapping artifacts. Existing methods that employ semantic segmentation or object detection for dynamic identification and filtering typically rely on predefined categorical priors, while discarding dynamic scene information crucial for robotic applications such as dynamic obstacle avoidance and environmental interaction. To overcome these challenges, we propose ADD-SLAM: an Adaptive Dynamic Dense SLAM framework based on Gaussian splitting. We design an adaptive dynamic identification mechanism grounded in scene consistency analysis, comparing geometric and textural discrepancies between real-time observations and historical maps. Ours requires no predefined semantic category priors and adaptively discovers scene dynamics. Precise dynamic object recognition effectively mitigates interference from moving targets during localization. Furthermore, we propose a dynamic-static separation mapping strategy that constructs a temporal Gaussian model to achieve online incremental dynamic modeling. Experiments conducted on multiple dynamic datasets demonstrate our method's flexible and accurate dynamic segmentation capabilities, along with state-of-the-art performance in both localization and mapping.

Title: LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Authors: Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19422
Pdf URL: https://arxiv.org/pdf/2505.19422
Copy Paste: [[2505.19422]] LlamaSeg: Image Segmentation via Autoregressive Mask Generation(https://arxiv.org/abs/2505.19422)
Keywords: transformer, generative, segmentation
Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

Title: Surrogate-Assisted Evolutionary Reinforcement Learning Based on Autoencoder and Hyperbolic Neural Network

Authors: Bingdong Li, Mei Jiang, Hong Qian, Peng Yang, Wenjing Hong, Hong Qian, Ke Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19423
Pdf URL: https://arxiv.org/pdf/2505.19423
Copy Paste: [[2505.19423]] Surrogate-Assisted Evolutionary Reinforcement Learning Based on Autoencoder and Hyperbolic Neural Network(https://arxiv.org/abs/2505.19423)
Keywords: robust
Abstract: Evolutionary Reinforcement Learning (ERL), training the Reinforcement Learning (RL) policies with Evolutionary Algorithms (EAs), have demonstrated enhanced exploration capabilities and greater robustness than using traditional policy gradient. However, ERL suffers from the high computational costs and low search efficiency, as EAs require evaluating numerous candidate policies with expensive simulations, many of which are ineffective and do not contribute meaningfully to the training. One intuitive way to reduce the ineffective evaluations is to adopt the surrogates. Unfortunately, existing ERL policies are often modeled as deep neural networks (DNNs) and thus naturally represented as high-dimensional vectors containing millions of weights, which makes the building of effective surrogates for ERL policies extremely challenging. This paper proposes a novel surrogate-assisted ERL that integrates Autoencoders (AE) and Hyperbolic Neural Networks (HNN). Specifically, AE compresses high-dimensional policies into low-dimensional representations while extracting key features as the inputs for the surrogate. HNN, functioning as a classification-based surrogate model, can learn complex nonlinear relationships from sampled data and enable more accurate pre-selection of the sampled policies without real evaluations. The experiments on 10 Atari and 4 Mujoco games have verified that the proposed method outperforms previous approaches significantly. The search trajectories guided by AE and HNN are also visually demonstrated to be more effective, in terms of both exploration and convergence. This paper not only presents the first learnable policy embedding and surrogate-modeling modules for high-dimensional ERL policies, but also empirically reveals when and why they can be successful.

Title: Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation

Authors: Yuhao He, Jinyu Tian, Haiwei Wu, Jianqing Li
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19425
Pdf URL: https://arxiv.org/pdf/2505.19425
Copy Paste: [[2505.19425]] Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation(https://arxiv.org/abs/2505.19425)
Keywords: protect, attack, robust, diffusion
Abstract: The rapid advancement of diffusion models has enhanced their image inpainting and editing capabilities but also introduced significant societal risks. Adversaries can exploit user images from social media to generate misleading or harmful content. While adversarial perturbations can disrupt inpainting, global perturbation-based methods fail in mask-guided editing tasks due to spatial constraints. To address these challenges, we propose Structure Disruption Attack (SDA), a powerful protection framework for safeguarding sensitive image regions against inpainting-based editing. Building upon the contour-focused nature of self-attention mechanisms of diffusion models, SDA optimizes perturbations by disrupting queries in self-attention during the initial denoising step to destroy the contour generation process. This targeted interference directly disrupts the structural generation capability of diffusion models, effectively preventing them from producing coherent images. We validate our motivation through visualization techniques and extensive experiments on public datasets, demonstrating that SDA achieves state-of-the-art (SOTA) protection performance while maintaining strong robustness.

Title: The Role of Diversity in In-Context Learning for Large Language Models

Authors: Wenyang Xiao, Haoyu Zhao, Lingxiao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19426
Pdf URL: https://arxiv.org/pdf/2505.19426
Copy Paste: [[2505.19426]] The Role of Diversity in In-Context Learning for Large Language Models(https://arxiv.org/abs/2505.19426)
Keywords: robust, large language model
Abstract: In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.

Title: WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

Authors: Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19427
Pdf URL: https://arxiv.org/pdf/2505.19427
Copy Paste: [[2505.19427]] WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference(https://arxiv.org/abs/2505.19427)
Keywords: robust, large language model
Abstract: The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94\%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at this https URL.

Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Authors: Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19430
Pdf URL: https://arxiv.org/pdf/2505.19430
Copy Paste: [[2505.19430]] Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation(https://arxiv.org/abs/2505.19430)
Keywords: large language model
Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.

Title: Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage

Authors: Chenguang Wang, Xiaoyu Zhang, Kaiyuan Cui, Weichen Zhao, Yongtao Guan, Tianshu Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19431
Pdf URL: https://arxiv.org/pdf/2505.19431
Copy Paste: [[2505.19431]] Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage(https://arxiv.org/abs/2505.19431)
Keywords: diffusion
Abstract: Training neural samplers directly from unnormalized densities without access to target distribution samples presents a significant challenge. A critical desideratum in these settings is achieving comprehensive mode coverage, ensuring the sampler captures the full diversity of the target distribution. However, prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives. Such objectives inherently exhibit mode-seeking behavior, potentially leading to incomplete representation of the underlying distribution. While alternative approaches strive for better mode coverage, they typically rely on implicit mechanisms like heuristics or iterative refinement. In this work, we propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence, which is conceptually known to encourage mode coverage. We introduce \textit{Importance Weighted Score Matching}, a method that optimizes this desired mode-covering objective by re-weighting the score matching loss using tractable importance sampling estimates, thereby overcoming the absence of target distribution data. We also provide theoretical analysis of the bias and variance for our proposed Monte Carlo estimator and the practical loss function used in our method. Experiments on increasingly complex multi-modal distributions, including 2D Gaussian Mixture Models with up to 120 modes and challenging particle systems with inherent symmetries -- demonstrate that our approach consistently outperforms existing neural samplers across all distributional distance metrics, achieving state-of-the-art results on all benchmarks.

Title: Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Authors: Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19433
Pdf URL: https://arxiv.org/pdf/2505.19433
Copy Paste: [[2505.19433]] Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression(https://arxiv.org/abs/2505.19433)
Keywords: large language model
Abstract: Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in this https URL.

Title: Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Authors: Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19439
Pdf URL: https://arxiv.org/pdf/2505.19439
Copy Paste: [[2505.19439]] Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers(https://arxiv.org/abs/2505.19439)
Keywords: large language model
Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth this http URL study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.

Title: The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

Authors: Shashata Sawmya, Micah Adler, Nir Shavit
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19440
Pdf URL: https://arxiv.org/pdf/2505.19440
Copy Paste: [[2505.19440]] The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models(https://arxiv.org/abs/2505.19440)
Keywords: interpretability, transformer, large language model
Abstract: This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

Title: MetaGMT: Improving Actionable Interpretability of Graph Multilinear Networks via Meta-Learning Filtration

Authors: Rishabh Bhattacharya, Hari Shankar, Vaishnavi Shivkumar, Ponnurangam Kumaraguru
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19445
Pdf URL: https://arxiv.org/pdf/2505.19445
Copy Paste: [[2505.19445]] MetaGMT: Improving Actionable Interpretability of Graph Multilinear Networks via Meta-Learning Filtration(https://arxiv.org/abs/2505.19445)
Keywords: robust, interpretability
Abstract: The growing adoption of Graph Neural Networks (GNNs) in high-stakes domains like healthcare and finance demands reliable explanations of their decision-making processes. While inherently interpretable GNN architectures like Graph Multi-linear Networks (GMT) have emerged, they remain vulnerable to generating explanations based on spurious correlations, potentially undermining trust in critical applications. We present MetaGMT, a meta-learning framework that enhances explanation fidelity through a novel bi-level optimization approach. We demonstrate that MetaGMT significantly improves both explanation quality (AUC-ROC, Precision@K) and robustness to spurious patterns, across BA-2Motifs, MUTAG, and SP-Motif benchmarks. Our approach maintains competitive classification accuracy while producing more faithful explanations (with an increase up to 8% of Explanation ROC on SP-Motif 0.5) compared to baseline methods. These advancements in interpretability could enable safer deployment of GNNs in sensitive domains by (1) facilitating model debugging through more reliable explanations, (2) supporting targeted retraining when biases are identified, and (3) enabling meaningful human oversight. By addressing the critical challenge of explanation reliability, our work contributes to building more trustworthy and actionable GNN systems for real-world applications.

Title: An Empirical Study of JavaScript Inclusion Security Issues in Chrome Extensions

Authors: Chong Guan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19456
Pdf URL: https://arxiv.org/pdf/2505.19456
Copy Paste: [[2505.19456]] An Empirical Study of JavaScript Inclusion Security Issues in Chrome Extensions(https://arxiv.org/abs/2505.19456)
Keywords: security
Abstract: JavaScript, a scripting language employed to augment the capabilities of web browsers within web pages or browser extensions, utilizes code segments termed JavaScript inclusions. While the security aspects of JavaScript inclusions in web pages have undergone substantial scrutiny, a thorough investigation into the security of such inclusions within browser extensions remains absent, despite the divergent security paradigms governing these environments. This study presents a systematic measurement of JavaScript inclusions in Chrome extensions, employing a hybrid methodology encompassing static and dynamic analysis to identify these inclusions. The analysis of 36,324 extensions revealed 350,784 JavaScript inclusions. Subsequent security assessment indicated that, although the majority of these inclusions originate from local files within the extensions rather than external servers, 22 instances of vulnerable remote JavaScript inclusions were identified. These remote inclusions present potential avenues for malicious actors to execute arbitrary code within the extension's execution context. Furthermore, an analysis of JavaScript library utilization within Chrome extensions disclosed the prevalent use of susceptible and outdated libraries, notably within numerous widely adopted extensions.

Title: Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation

Authors: Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19459
Pdf URL: https://arxiv.org/pdf/2505.19459
Copy Paste: [[2505.19459]] Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation(https://arxiv.org/abs/2505.19459)
Keywords: robust, generative
Abstract: Joint Energy-based Models (JEMs), a class of hybrid generative-discriminative models, are well known for their ability to achieve both high classification accuracy and generative capability within a single model. However, their robustness still lags significantly behind the classifiers based adversarial training (AT). Conversely, while AT is currently the most effective approach to improving the classifier's robustness, it typically sacrifices accuracy on clean data and lacks generative capability. The triple trade-off between classification accuracy, generative capability and robustness, raises a natural question: Can a single model simultaneously achieve high classification accuracy, adversarial robustness, and generative performance? -- a goal that has been rarely explored. To address this question, we systematically analyze the energy distribution differences of clean, adversarial, and generated samples across various JEM variants and adversarially trained models. We observe that AT tends to reduce the energy gap between clean and adversarial samples, while JEMs reduce the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might unify the strengths of AT and JEMs, resolving their inherent trade-offs. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), to jointly model the clean data distribution, the adversarial distribution, and the classifier by maximizing their joint probability. EB-JDAT is a general and flexible optimization method, compatible with various JEM variants. Extensive experimental results demonstrate that EB-JDAT not only maintains near original accuracy and generative capability of JEMs, but also significantly enhances robustness, even surpassing state-of-the-art ATs.

Title: Residual Cross-Attention Transformer-Based Multi-User CSI Feedback with Deep Joint Source-Channel Coding

Authors: Hengwei Zhang, Minghui Wu, Li Qiao, Ling Liu, Ziqi Han, Zhen Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19465
Pdf URL: https://arxiv.org/pdf/2505.19465
Copy Paste: [[2505.19465]] Residual Cross-Attention Transformer-Based Multi-User CSI Feedback with Deep Joint Source-Channel Coding(https://arxiv.org/abs/2505.19465)
Keywords: transformer
Abstract: This letter proposes a deep-learning (DL)-based multi-user channel state information (CSI) feedback framework for massive multiple-input multiple-output systems, where the deep joint source-channel coding (DJSCC) is utilized to improve the CSI reconstruction accuracy. Specifically, we design a multi-user joint CSI feedback framework, whereby the CSI correlation of nearby users is utilized to reduce the feedback overhead. Under the framework, we propose a new residual cross-attention transformer architecture, which is deployed at the base station to further improve the CSI feedback performance. Moreover, to tackle the "cliff-effect" of conventional bit-level CSI feedback approaches, we integrated DJSCC into the multi-user CSI feedback, together with utilizing a two-stage training scheme to adapt to varying uplink noise levels. Experimental results demonstrate the superiority of our methods in CSI feedback performance, with low network complexity and better scalability.

Title: Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory

Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, Miki Haseyama
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19469
Pdf URL: https://arxiv.org/pdf/2505.19469
Copy Paste: [[2505.19469]] Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory(https://arxiv.org/abs/2505.19469)
Keywords: diffusion, generative
Abstract: Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.

Title: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

Authors: Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19472
Pdf URL: https://arxiv.org/pdf/2505.19472
Copy Paste: [[2505.19472]] Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks(https://arxiv.org/abs/2505.19472)
Keywords: transformer
Abstract: Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).

Title: Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Authors: Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, Walid Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19475
Pdf URL: https://arxiv.org/pdf/2505.19475
Copy Paste: [[2505.19475]] Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection(https://arxiv.org/abs/2505.19475)
Keywords: large language model
Abstract: Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.

Title: Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Authors: Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman
Subjects: cs.LG, cs.AI, cs.DC, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19481
Pdf URL: https://arxiv.org/pdf/2505.19481
Copy Paste: [[2505.19481]] Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs(https://arxiv.org/abs/2505.19481)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.

Title: Language of Network: A Generative Pre-trained Model for Encrypted Traffic Comprehension

Authors: Di Zhao, Bo Jiang, Song Liu, Susu Cui, Meng Shen, Dongqi Han, Xingmao Guan, Zhigang Lu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19482
Pdf URL: https://arxiv.org/pdf/2505.19482
Copy Paste: [[2505.19482]] Language of Network: A Generative Pre-trained Model for Encrypted Traffic Comprehension(https://arxiv.org/abs/2505.19482)
Keywords: security, privacy, protect, attack, generative
Abstract: The increasing demand for privacy protection and security considerations leads to a significant rise in the proportion of encrypted network traffic. Since traffic content becomes unrecognizable after encryption, accurate analysis is challenging, making it difficult to classify applications and detect attacks. Deep learning is currently the predominant approach for encrypted traffic classification through feature analysis. However, these methods face limitations due to their high dependence on labeled data and difficulties in detecting attack variants. First, their performance is highly sensitive to data quality, where the highcost manual labeling process and dataset imbalance significantly degrade results. Second, the rapid evolution of attack patterns makes it challenging for models to identify new types of attacks. To tackle these challenges, we present GBC, a generative model based on pre-training for encrypted traffic comprehension. Since traditional tokenization methods are primarily designed for natural language, we propose a protocol-aware tokenization approach for encrypted traffic that improves model comprehension of fields specific to network traffic. In addition, GBC employs pretraining to learn general representations from extensive unlabeled traffic data. Through prompt learning, it effectively adapts to various downstream tasks, enabling both high-quality traffic generation and effective detection. Evaluations across multiple datasets demonstrate that GBC achieves superior results in both traffic classification and generation tasks, resulting in a 5% improvement in F1 score compared to state-of-the-art methods for classification tasks.

Title: CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis

Authors: Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, Shuo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19484
Pdf URL: https://arxiv.org/pdf/2505.19484
Copy Paste: [[2505.19484]] CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis(https://arxiv.org/abs/2505.19484)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalCultureQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalCultureQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.

Title: Understanding Transformer from the Perspective of Associative Memory

Authors: Shu Zhong, Mingyu Xu, Tenglong Ao, Guang Shi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19488
Pdf URL: https://arxiv.org/pdf/2505.19488
Copy Paste: [[2505.19488]] Understanding Transformer from the Perspective of Associative Memory(https://arxiv.org/abs/2505.19488)
Keywords: transformer
Abstract: In this paper, we share our reflections and insights on understanding Transformer architectures through the lens of associative memory--a classic psychological concept inspired by human cognition. We start with the basics of associative memory (think simple linear attention) and then dive into two dimensions: Memory Capacity: How much can a Transformer really remember, and how well? We introduce retrieval SNR to measure this and use a kernel perspective to mathematically reveal why Softmax Attention is so effective. We also show how FFNs can be seen as a type of associative memory, leading to insights on their design and potential improvements. Memory Update: How do these memories learn and evolve? We present a unified framework for understanding how different Transformer variants (like DeltaNet and Softmax Attention) update their "knowledge base". This leads us to tackle two provocative questions: 1. Are Transformers fundamentally limited in what they can express, and can we break these barriers? 2. If a Transformer had infinite context, would it become infinitely intelligent? We want to demystify Transformer architecture, offering a clearer understanding of existing designs. This exploration aims to provide fresh insights and spark new avenues for Transformer innovation.

Title: ViewCraft3D: High-Fidelity and View-Consistent 3D Vector Graphics Synthesis

Authors: Chuang Wang, Haitao Zhou, Ling Luo, Qian Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19492
Pdf URL: https://arxiv.org/pdf/2505.19492
Copy Paste: [[2505.19492]] ViewCraft3D: High-Fidelity and View-Consistent 3D Vector Graphics Synthesis(https://arxiv.org/abs/2505.19492)
Keywords: extraction
Abstract: 3D vector graphics play a crucial role in various applications including 3D shape retrieval, conceptual design, and virtual reality interactions due to their ability to capture essential structural information with minimal representation. While recent approaches have shown promise in generating 3D vector graphics, they often suffer from lengthy processing times and struggle to maintain view consistency. To address these limitations, we propose ViewCraft3D (VC3D), an efficient method that leverages 3D priors to generate 3D vector graphics. Specifically, our approach begins with 3D object analysis, employs a geometric extraction algorithm to fit 3D vector graphics to the underlying structure, and applies view-consistent refinement process to enhance visual quality. Our comprehensive experiments demonstrate that VC3D outperforms previous methods in both qualitative and quantitative evaluations, while significantly reducing computational overhead. The resulting 3D sketches maintain view consistency and effectively capture the essential characteristics of the original objects.

Title: The Role of Video Generation in Enhancing Data-Limited Action Understanding

Authors: Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19495
Pdf URL: https://arxiv.org/pdf/2505.19495
Copy Paste: [[2505.19495]] The Role of Video Generation in Enhancing Data-Limited Action Understanding(https://arxiv.org/abs/2505.19495)
Keywords: diffusion, transformer
Abstract: Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

Title: DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Authors: Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19504
Pdf URL: https://arxiv.org/pdf/2505.19504
Copy Paste: [[2505.19504]] DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation(https://arxiv.org/abs/2505.19504)
Keywords: protect, defense, watermark, large language model
Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.

Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

Authors: Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19510
Pdf URL: https://arxiv.org/pdf/2505.19510
Copy Paste: [[2505.19510]] LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study(https://arxiv.org/abs/2505.19510)
Keywords: large language model
Abstract: The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at this https URL. Additionally, our code and evaluation data are publicly available at this https URL.

Title: Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models

Authors: Aggrey Muhebwa, Khalid K. Osman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19511
Pdf URL: https://arxiv.org/pdf/2505.19511
Copy Paste: [[2505.19511]] Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models(https://arxiv.org/abs/2505.19511)
Keywords: robust
Abstract: Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher's reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.

Title: SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback

Authors: Yaoning Yu, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19514
Pdf URL: https://arxiv.org/pdf/2505.19514
Copy Paste: [[2505.19514]] SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback(https://arxiv.org/abs/2505.19514)
Keywords: large language model
Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.

Title: Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions

Authors: Nakul Poudel, Zixin Yang, Kelly Merrell, Richard Simon, Cristian A. Linte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19518
Pdf URL: https://arxiv.org/pdf/2505.19518
Copy Paste: [[2505.19518]] Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions(https://arxiv.org/abs/2505.19518)
Keywords: robust
Abstract: Intra-operative data captured during image-guided surgery lacks sub-surface information, where key regions of interest, such as vessels and tumors, reside. Image-to-physical registration enables the fusion of pre-operative information and intra-operative data, typically represented as a point cloud. However, this registration process struggles due to partial visibility of the intra-operative point cloud. In this research, we propose a patient-specific point cloud completion approach to assist with the registration process. Specifically, we leverage VN-OccNet to generate a complete liver surface from a partial intra-operative point cloud. The network is trained in a patient-specific manner, where simulated deformations from the pre-operative model are used to train the model. First, we conduct an in-depth analysis of VN-OccNet's rotation-equivariant property and its effectiveness in recovering complete surfaces from partial intra-operative surfaces. Next, we integrate the completed intra-operative surface into the Go-ICP registration algorithm to demonstrate its utility in improving initial rigid registration outcomes. Our results highlight the promise of this patient-specific completion approach in mitigating the challenges posed by partial intra-operative visibility. The rotation equivariant and surface generation capabilities of VN-OccNet hold strong promise for developing robust registration frameworks for variations of the intra-operative point cloud.

Title: Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift

Authors: Gihoon Kim, Hyungjin Park, Taesup Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19519
Pdf URL: https://arxiv.org/pdf/2505.19519
Copy Paste: [[2505.19519]] Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift(https://arxiv.org/abs/2505.19519)
Keywords: diffusion, generative
Abstract: Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires integrating new concepts without forgetting previously learned generative capabilities. Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model. In this paper, we provide an analysis of this issue and identify a mismatch between standard training objectives and the goals of personalization. To address this, we propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution. Our method provides improved control over distributional drift and performs well even in data-scarce scenarios. Experimental results demonstrate that our approach consistently outperforms existing personalization methods, achieving higher CLIP-T, CLIP-I, and DINO scores.

Title: Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning

Authors: Jiyu Hu, Haijiang Zeng, Zhen Tian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19522
Pdf URL: https://arxiv.org/pdf/2505.19522
Copy Paste: [[2505.19522]] Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning(https://arxiv.org/abs/2505.19522)
Keywords: generative
Abstract: In recent years, image classification, as a core task in computer vision, relies on high-quality labelled data, which restricts the wide application of deep learning models in practical scenarios. To alleviate the problem of insufficient labelled samples, semi-supervised learning has gradually become a research hotspot. In this paper, we construct a semi-supervised image classification model based on Generative Adversarial Networks (GANs), and through the introduction of the collaborative training mechanism of generators, discriminators and classifiers, we achieve the effective use of limited labelled data and a large amount of unlabelled data, improve the quality of image generation and classification accuracy, and provide an effective solution for the task of image recognition in complex environments.

Title: Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation

Authors: Mohammed D. Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2505.19527
Pdf URL: https://arxiv.org/pdf/2505.19527
Copy Paste: [[2505.19527]] Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation(https://arxiv.org/abs/2505.19527)
Keywords: robust
Abstract: Training large neural networks through gradient-based optimization requires navigating high-dimensional loss landscapes, which often exhibit pathological geometry, leading to undesirable training dynamics. In particular, poor generalization frequently results from convergence to sharp minima that are highly sensitive to input perturbations, causing the model to overfit the training data while failing to generalize to unseen examples. Furthermore, these optimization procedures typically display strong dependence on the fine structure of the loss landscape, leading to unstable training dynamics, due to the fractal-like nature of the loss surface. In this work, we propose an alternative optimizer that simultaneously reduces this dependence, and avoids sharp minima, thereby improving generalization. This is achieved by simulating the motion of the center of a ball rolling on the loss landscape. The degree to which our optimizer departs from the standard gradient descent is controlled by a hyperparameter, representing the radius of the ball. Changing this hyperparameter allows for probing the loss landscape at different scales, making it a valuable tool for understanding its geometry.

Title: AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection

Authors: Yejin Lee, Joonghyuk Hahn, Hyeseon Ahn, Yo-Sub Han
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.19528
Pdf URL: https://arxiv.org/pdf/2505.19528
Copy Paste: [[2505.19528]] AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection(https://arxiv.org/abs/2505.19528)
Keywords: robust, interpretability
Abstract: Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.

Title: Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Authors: Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, Zhao Song, Han Liu
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19531
Pdf URL: https://arxiv.org/pdf/2505.19531
Copy Paste: [[2505.19531]] Minimalist Softmax Attention Provably Learns Constrained Boolean Functions(https://arxiv.org/abs/2505.19531)
Keywords: transformer
Abstract: We study the computational limits of learning $k$-bit Boolean functions (specifically, $\mathrm{AND}$, $\mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=\Theta(d)$ relevant bits are selected from $d$ inputs. We show that these simple $\mathrm{AND}$ and $\mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Title: Fox in the Henhouse: Supply-Chain Backdoor Attacks Against Reinforcement Learning

Authors: Shijie Liu, Andrew C. Cullen, Paul Montague, Sarah Erfani, Benjamin I. P. Rubinstein
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19532
Pdf URL: https://arxiv.org/pdf/2505.19532
Copy Paste: [[2505.19532]] Fox in the Henhouse: Supply-Chain Backdoor Attacks Against Reinforcement Learning(https://arxiv.org/abs/2505.19532)
Keywords: attack
Abstract: The current state-of-the-art backdoor attacks against Reinforcement Learning (RL) rely upon unrealistically permissive access models, that assume the attacker can read (or even write) the victim's policy parameters, observations, or rewards. In this work, we question whether such a strong assumption is required to launch backdoor attacks against RL. To answer this question, we propose the \underline{S}upply-\underline{C}h\underline{a}in \underline{B}ackdoor (SCAB) attack, which targets a common RL workflow: training agents using external agents that are provided separately or embedded within the environment. In contrast to prior works, our attack only relies on legitimate interactions of the RL agent with the supplied agents. Despite this limited access model, by poisoning a mere $3\%$ of training experiences, our attack can successfully activate over $90\%$ of triggered actions, reducing the average episodic return by $80\%$ for the victim. Our novel attack demonstrates that RL attacks are likely to become a reality under untrusted RL training supply-chains.

Title: ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models

Authors: Yachuan Liu, Xiaochun Wei, Lin Shi, Xinnuo Li, Bohan Zhang, Paramveer Dhillon, Qiaozhu Mei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19533
Pdf URL: https://arxiv.org/pdf/2505.19533
Copy Paste: [[2505.19533]] ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models(https://arxiv.org/abs/2505.19533)
Keywords: large language model
Abstract: Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models' reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs' temporal reasoning ability for time-sensitive applications.

Title: TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

Authors: Juntong Wang, Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19535
Pdf URL: https://arxiv.org/pdf/2505.19535
Copy Paste: [[2505.19535]] TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs(https://arxiv.org/abs/2505.19535)
Keywords: large language model
Abstract: Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.

Title: SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds

Authors: Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, David Osowiechi, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19546
Pdf URL: https://arxiv.org/pdf/2505.19546
Copy Paste: [[2505.19546]] SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds(https://arxiv.org/abs/2505.19546)
Keywords: robust
Abstract: Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: this https URL.

Title: STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization

Authors: Haoyu Zhang, Wentao Zhang, Hao Miao, Xinke Jiang, Yuchen Fang, Yifan Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19547
Pdf URL: https://arxiv.org/pdf/2505.19547
Copy Paste: [[2505.19547]] STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization(https://arxiv.org/abs/2505.19547)
Keywords: robust
Abstract: Spatio-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool for modeling dynamic graph-structured data across diverse domains. However, they often fail to generalize in Spatio-Temporal Out-of-Distribution (STOOD) scenarios, where both temporal dynamics and spatial structures evolve beyond the training distribution. To address this problem, we propose an innovative Spatio-Temporal Retrieval-Augmented Pattern Learning framework,STRAP, which enhances model generalization by integrating retrieval-augmented learning into the STGNN continue learning pipeline. The core of STRAP is a compact and expressive pattern library that stores representative spatio-temporal patterns enriched with historical, structural, and semantic information, which is obtained and optimized during the training phase. During inference, STRAP retrieves relevant patterns from this library based on similarity to the current input and injects them into the model via a plug-and-play prompting mechanism. This not only strengthens spatio-temporal representations but also mitigates catastrophic forgetting. Moreover, STRAP introduces a knowledge-balancing objective to harmonize new information with retrieved knowledge. Extensive experiments across multiple real-world streaming graph datasets show that STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks, demonstrating its robustness, adaptability, and strong generalization capability without task-specific fine-tuning.

Title: How Syntax Specialization Emerges in Language Models

Authors: Xufeng Duan, Zhaoqian Yao, Yunhao Zhang, Shaonan Wang, Zhenguang G. Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19548
Pdf URL: https://arxiv.org/pdf/2505.19548
Copy Paste: [[2505.19548]] How Syntax Specialization Emerges in Language Models(https://arxiv.org/abs/2505.19548)
Keywords: large language model
Abstract: Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.

Title: Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

Authors: Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19549
Pdf URL: https://arxiv.org/pdf/2505.19549
Copy Paste: [[2505.19549]] Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents(https://arxiv.org/abs/2505.19549)
Keywords: large language model, segmentation
Abstract: Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.

Title: On scalable and efficient training of diffusion samplers

Authors: Minkyu Kim, Kiyoung Seong, Dongyeop Woo, Sungsoo Ahn, Minsu Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19552
Pdf URL: https://arxiv.org/pdf/2505.19552
Copy Paste: [[2505.19552]] On scalable and efficient training of diffusion samplers(https://arxiv.org/abs/2505.19552)
Keywords: diffusion
Abstract: We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i.e., the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.

Title: Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation

Authors: Jiongchao Jin, Shengchu Zhao, Dajun Chen, Wei Jiang, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19554
Pdf URL: https://arxiv.org/pdf/2505.19554
Copy Paste: [[2505.19554]] Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation(https://arxiv.org/abs/2505.19554)
Keywords: transformer, generative, large language model
Abstract: Time consumption and the complexity of manual layout design make automated layout generation a critical task, especially for multiple applications across different mobile devices. Existing graph-based layout generation approaches suffer from limited generative capability, often resulting in unreasonable and incompatible outputs. Meanwhile, vision based generative models tend to overlook the original structural information, leading to component intersections and overlaps. To address these challenges, we propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability. This novel pipeline utilizes graph features as hierarchical prior knowledge, replacing the traditional Vision Transformer (ViT) module in multimodal large language models (MLLM) to predict full layout information for the first time. Moreover, the intermediate graph matrix used as input for the LLM is human editable, enabling progressive, human centric design generation. A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU), and qualitatively through a crowdsourced user study. Additionally, sampling on relational features ensures diverse layout generation, further enhancing the adaptability and creativity of the proposed approach.

Title: Few-Shot Class-Incremental Learning For Efficient SAR Automatic Target Recognition

Authors: George Karantaidis, Athanasios Pantsios, Ioannis Kompatsiaris, Symeon Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19565
Pdf URL: https://arxiv.org/pdf/2505.19565
Copy Paste: [[2505.19565]] Few-Shot Class-Incremental Learning For Efficient SAR Automatic Target Recognition(https://arxiv.org/abs/2505.19565)
Keywords: robust, extraction
Abstract: Synthetic aperture radar automatic target recognition (SAR-ATR) systems have rapidly evolved to tackle incremental recognition challenges in operational settings. Data scarcity remains a major hurdle that conventional SAR-ATR techniques struggle to address. To cope with this challenge, we propose a few-shot class-incremental learning (FSCIL) framework based on a dual-branch architecture that focuses on local feature extraction and leverages the discrete Fourier transform and global filters to capture long-term spatial dependencies. This incorporates a lightweight cross-attention mechanism that fuses domain-specific features with global dependencies to ensure robust feature interaction, while maintaining computational efficiency by introducing minimal scale-shift parameters. The framework combines focal loss for class distinction under imbalance and center loss for compact intra-class distributions to enhance class separation boundaries. Experimental results on the MSTAR benchmark dataset demonstrate that the proposed framework consistently outperforms state-of-the-art methods in FSCIL SAR-ATR, attesting to its effectiveness in real-world scenarios.

Title: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

Authors: Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19569
Pdf URL: https://arxiv.org/pdf/2505.19569
Copy Paste: [[2505.19569]] What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation(https://arxiv.org/abs/2505.19569)
Keywords: generative, segmentation
Abstract: Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.

Title: DocMEdit: Towards Document-Level Model Editing

Authors: Li Zeng, Zeming Liu, Chong Feng, Heyan Huang, Yuhang Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19572
Pdf URL: https://arxiv.org/pdf/2505.19572
Copy Paste: [[2505.19572]] DocMEdit: Towards Document-Level Model Editing(https://arxiv.org/abs/2505.19572)
Keywords: large language model
Abstract: Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.

Title: Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Authors: Kaiqing Lin, Zhiyuan Yan, Ke-Yue Zhang, Li Hao, Yue Zhou, Yuzhen Lin, Weixiang Li, Taiping Yao, Shouhong Ding, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19582
Pdf URL: https://arxiv.org/pdf/2505.19582
Copy Paste: [[2505.19582]] Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes(https://arxiv.org/abs/2505.19582)
Keywords: protect, attack, large language model
Abstract: Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., "VIP individuals" whose authentic facial data are already available. In this paper, we propose \textbf{VIPGuard}, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbf{VIPBench} for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation.

Title: Beyond Segmentation: Confidence-Aware and Debiased Estimation of Ratio-based Biomarkers

Authors: Jiameng Li, Teodora Popordanoska, Sebastian G. Gruber, Frederik Maes, Matthew B. Blaschko
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19585
Pdf URL: https://arxiv.org/pdf/2505.19585
Copy Paste: [[2505.19585]] Beyond Segmentation: Confidence-Aware and Debiased Estimation of Ratio-based Biomarkers(https://arxiv.org/abs/2505.19585)
Keywords: segmentation
Abstract: Ratio-based biomarkers -- such as the proportion of necrotic tissue within a tumor -- are widely used in clinical practice to support diagnosis, prognosis and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified \textit{confidence-aware} framework for estimating ratio-based biomarkers. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. To mitigate this, we incorporate a lightweight, post-hoc calibration module that can be applied using internal hospital data without retraining. We leverage a tunable parameter $Q$ to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.

Title: TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Authors: Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19586
Pdf URL: https://arxiv.org/pdf/2505.19586
Copy Paste: [[2505.19586]] TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization(https://arxiv.org/abs/2505.19586)
Keywords: generative, large language model
Abstract: The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

Title: WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts

Authors: Shadi Alijani, Homayoun Najjaran
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19587
Pdf URL: https://arxiv.org/pdf/2505.19587
Copy Paste: [[2505.19587]] WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts(https://arxiv.org/abs/2505.19587)
Keywords: robust
Abstract: Conformal prediction (CP) provides a framework for constructing prediction sets with guaranteed coverage, assuming exchangeable data. However, real-world scenarios often involve distribution shifts that violate exchangeability, leading to unreliable coverage and inflated prediction sets. To address this challenge, we first introduce Reconstruction Loss-Scaled Conformal Prediction (RLSCP), which utilizes reconstruction losses derived from a Variational Autoencoder (VAE) as an uncertainty metric to scale score functions. While RLSCP demonstrates performance improvements, mainly resulting in better coverage, it quantifies quantiles based on a fixed calibration dataset without considering the discrepancies between test and train datasets in an unexchangeable setting. In the next step, we propose Weighted Quantile Loss-scaled Conformal Prediction (WQLCP), which refines RLSCP by incorporating a weighted notion of exchangeability, adjusting the calibration quantile threshold based on weights with respect to the ratio of calibration and test loss values. This approach improves the CP-generated prediction set outputs in the presence of distribution shifts. Experiments on large-scale datasets, including ImageNet variants, demonstrate that WQLCP outperforms existing baselines by consistently maintaining coverage while reducing prediction set sizes, providing a robust solution for CP under distribution shifts.

Title: Model Agnostic Differentially Private Causal Inference

Authors: Christiant Lebeda, Mathieu Even, Aurélien Bellet, Julie Josse
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19589
Pdf URL: https://arxiv.org/pdf/2505.19589
Copy Paste: [[2505.19589]] Model Agnostic Differentially Private Causal Inference(https://arxiv.org/abs/2505.19589)
Keywords: privacy, protect
Abstract: Estimating causal effects from observational data is essential in fields such as medicine, economics and social sciences, where privacy concerns are paramount. We propose a general, model-agnostic framework for differentially private estimation of average treatment effects (ATE) that avoids strong structural assumptions on the data-generating process or the models used to estimate propensity scores and conditional outcomes. In contrast to prior work, which enforces differential privacy by directly privatizing these nuisance components and results in a privacy cost that scales with model complexity, our approach decouples nuisance estimation from privacy protection. This separation allows the use of flexible, state-of-the-art black-box models, while differential privacy is achieved by perturbing only predictions and aggregation steps within a fold-splitting scheme with ensemble techniques. We instantiate the framework for three classical estimators -- the G-formula, inverse propensity weighting (IPW), and augmented IPW (AIPW) -- and provide formal utility and privacy guarantees. Empirical results show that our methods maintain competitive performance under realistic privacy budgets. We further extend our framework to support meta-analysis of multiple private ATE estimates. Our results bridge a critical gap between causal inference and privacy-preserving data analysis.

Title: Learning to Reason without External Rewards

Authors: Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19590
Pdf URL: https://arxiv.org/pdf/2505.19590
Copy Paste: [[2505.19590]] Learning to Reason without External Rewards(https://arxiv.org/abs/2505.19590)
Keywords: large language model
Abstract: Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL

Title: Multi-Agent Collaboration via Evolving Orchestration

Authors: Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19591
Pdf URL: https://arxiv.org/pdf/2505.19591
Copy Paste: [[2505.19591]] Multi-Agent Collaboration via Evolving Orchestration(https://arxiv.org/abs/2505.19591)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.

Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Authors: Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19598
Pdf URL: https://arxiv.org/pdf/2505.19598
Copy Paste: [[2505.19598]] Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study(https://arxiv.org/abs/2505.19598)
Keywords: secure, defense, attack, robust
Abstract: Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.

Title: Preference Optimization by Estimating the Ratio of the Data Distribution

Authors: Yeongmin Kim, Heesun Bae, Byeonghu Na, Il-Chul Moon
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19601
Pdf URL: https://arxiv.org/pdf/2505.19601
Copy Paste: [[2505.19601]] Preference Optimization by Estimating the Ratio of the Data Distribution(https://arxiv.org/abs/2505.19601)
Keywords: large language model
Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2.

Title: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Authors: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19602
Pdf URL: https://arxiv.org/pdf/2505.19602
Copy Paste: [[2505.19602]] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression(https://arxiv.org/abs/2505.19602)
Keywords: transformer
Abstract: Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.

Title: Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging

Authors: Ho Hin Lee, Quan Liu, Shunxing Bao, Yuankai Huo, Bennett A. Landman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19603
Pdf URL: https://arxiv.org/pdf/2505.19603
Copy Paste: [[2505.19603]] Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging(https://arxiv.org/abs/2505.19603)
Keywords: transformer, segmentation
Abstract: In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at this https URL.

Title: Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity

Authors: Aggrey Muhebwa, Khotso Selialia, Fatima Anwar, Khalid K. Osman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19605
Pdf URL: https://arxiv.org/pdf/2505.19605
Copy Paste: [[2505.19605]] Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity(https://arxiv.org/abs/2505.19605)
Keywords: federate
Abstract: Federated learning on heterogeneous (non-IID) client data experiences slow convergence due to client drift. To address this challenge, we propose Kuramoto-FedAvg, a federated optimization algorithm that reframes the weight aggregation step as a synchronization problem inspired by the Kuramoto model of coupled oscillators. The server dynamically weighs each client's update based on its phase alignment with the global update, amplifying contributions that align with the global gradient direction while minimizing the impact of updates that are out of phase. We theoretically prove that this synchronization mechanism reduces client drift, providing a tighter convergence bound compared to the standard FedAvg under heterogeneous data distributions. Empirical validation supports our theoretical findings, showing that Kuramoto-FedAvg significantly accelerates convergence and improves accuracy across multiple benchmark datasets. Our work highlights the potential of coordination and synchronization-based strategies for managing gradient diversity and accelerating federated optimization in realistic non-IID settings.

Title: Energy-based Preference Optimization for Test-time Adaptation

Authors: Yewon Han, Seoyun Yang, Taesup Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19607
Pdf URL: https://arxiv.org/pdf/2505.19607
Copy Paste: [[2505.19607]] Energy-based Preference Optimization for Test-time Adaptation(https://arxiv.org/abs/2505.19607)
Keywords: robust
Abstract: Test-Time Adaptation (TTA) enhances model robustness by enabling adaptation to target distributions that differ from training distributions, improving real-world generalizability. Existing TTA approaches focus on adjusting the conditional distribution; however these methods often depend on uncertain predictions in the absence of label information, leading to unreliable performance. Energy-based frameworks suggest a promising alternative to address distribution shifts without relying on uncertain predictions, instead computing the marginal distribution of target data. However, they involve the critical challenge of requiring extensive SGLD sampling, which is impractical for test-time scenarios requiring immediate adaptation. In this work, we propose Energy-based Preference Optimization for Test-time Adaptation (EPOTTA), which is based on a sampling free strategy. We first parameterize the target model using a pretrained model and residual energy function, enabling marginal likelihood maximization of target data without sampling. Building on the observation that the parameterization is mathematically equivalent to DPO objective, we then directly adapt the model to a target distribution without explicitly training the residual. Our experiments verify that EPOTTA is well-calibrated and performant while achieving computational efficiency.

Title: Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

Authors: Hongtao Xu, Wenting Shen, Yuanxin Wei, Ang Wang, Guo Runfan, Tianxing Wang, Yong Li, Mingzhen Li, Weile Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19609
Pdf URL: https://arxiv.org/pdf/2505.19609
Copy Paste: [[2505.19609]] Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling(https://arxiv.org/abs/2505.19609)
Keywords: large language model
Abstract: Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation requirements of long and short sequences, improving overall training efficiency. Furthermore, we formulate the scheduling process as a joint optimization problem and thoroughly analyze the trade-offs involved. Based on those analysis, Skrull employs a lightweight scheduling algorithm to achieve near-zero cost online scheduling in Long-SFT. Finally, we implement Skrull upon DeepSpeed, a state-of-the-art distributed training system for LLMs. Experimental results demonstrate that Skrull outperforms DeepSpeed by 3.76x on average (up to 7.54x) in real-world long-SFT scenarios.

Title: JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Authors: Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19610
Pdf URL: https://arxiv.org/pdf/2505.19610
Copy Paste: [[2505.19610]] JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models(https://arxiv.org/abs/2505.19610)
Keywords: defense, attack, robust
Abstract: Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer's latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model's internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.

Title: TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization

Authors: Amira Guesmi, Bassem Ouni, Muhammad Shafique
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19613
Pdf URL: https://arxiv.org/pdf/2505.19613
Copy Paste: [[2505.19613]] TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization(https://arxiv.org/abs/2505.19613)
Keywords: security, attack, robust, transformer
Abstract: Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbf{TESSER} -- a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9\% higher attack succes rate (ASR) on CNNs and +7.2\% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55\% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER's perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12\% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.

Title: Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Authors: Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19616
Pdf URL: https://arxiv.org/pdf/2505.19616
Copy Paste: [[2505.19616]] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models(https://arxiv.org/abs/2505.19616)
Keywords: robust, fair, large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

Title: SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows

Authors: Janik Kreit, Dominic Schuh, Kim A. Nicoli, Lena Funcke
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2505.19619
Pdf URL: https://arxiv.org/pdf/2505.19619
Copy Paste: [[2505.19619]] SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows(https://arxiv.org/abs/2505.19619)
Keywords: generative
Abstract: Deep generative models have recently garnered significant attention across various fields, from physics to chemistry, where sampling from unnormalized Boltzmann-like distributions represents a fundamental challenge. In particular, autoregressive models and normalizing flows have become prominent due to their appealing ability to yield closed-form probability densities. Moreover, it is well-established that incorporating prior knowledge - such as symmetries - into deep neural networks can substantially improve training performances. In this context, recent advances have focused on developing symmetry-equivariant generative models, achieving remarkable results. Building upon these foundations, this paper introduces Symmetry-Enforcing Stochastic Modulation (SESaMo). Similar to equivariant normalizing flows, SESaMo enables the incorporation of inductive biases (e.g., symmetries) into normalizing flows through a novel technique called stochastic modulation. This approach enhances the flexibility of the generative model, allowing to effectively learn a variety of exact and broken symmetries. Our numerical experiments benchmark SESaMo in different scenarios, including an 8-Gaussian mixture model and physically relevant field theories, such as the $\phi^4$ theory and the Hubbard model.

Title: Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs

Authors: Jiawen Chen, Qi Shao, Duxin Chen, Wenwu Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19620
Pdf URL: https://arxiv.org/pdf/2505.19620
Copy Paste: [[2505.19620]] Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs(https://arxiv.org/abs/2505.19620)
Keywords: large language model
Abstract: Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at this https URL.

Title: HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

Authors: Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19628
Pdf URL: https://arxiv.org/pdf/2505.19628
Copy Paste: [[2505.19628]] HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices(https://arxiv.org/abs/2505.19628)
Keywords: large language model
Abstract: Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at this https URL.

Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Authors: Yichun Feng, Jiawei Wang, Lu Zhou, Yixue Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19630
Pdf URL: https://arxiv.org/pdf/2505.19630
Copy Paste: [[2505.19630]] DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue(https://arxiv.org/abs/2505.19630)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Existing systems rely on a one-way information transmission mode where patients must fully describe their symptoms in a single round, leading to nonspecific diagnostic recommendations when complaints are vague. Traditional multi-turn dialogue methods based on supervised learning are constrained by static data-driven paradigms, lacking generalizability and struggling to intelligently extract key clinical information. To address these limitations, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, demonstrating practical value in assisting clinical consultations. this https URL

Title: Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Authors: Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19631
Pdf URL: https://arxiv.org/pdf/2505.19631
Copy Paste: [[2505.19631]] Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models(https://arxiv.org/abs/2505.19631)
Keywords: large language model, segmentation
Abstract: Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at this https URL

Title: Weak-Jamming Detection in IEEE 802.11 Networks: Techniques, Scenarios and Mobility

Authors: Martijn Hanegraaf, Savio Sciancalepore, Gabriele Oligeri
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19633
Pdf URL: https://arxiv.org/pdf/2505.19633
Copy Paste: [[2505.19633]] Weak-Jamming Detection in IEEE 802.11 Networks: Techniques, Scenarios and Mobility(https://arxiv.org/abs/2505.19633)
Keywords: attack
Abstract: State-of-the-art solutions detect jamming attacks ex-post, i.e., only when jamming has already disrupted the wireless communication link. In many scenarios, e.g., mobile networks or static deployments distributed over a large geographical area, it is often desired to detect jamming at the early stage, when it affects the communication link enough to be detected but not sufficiently to disrupt it (detection of weak jamming signals). Under such assumptions, devices can enhance situational awareness and promptly apply mitigation, e.g., moving away from the jammed area in mobile scenarios or changing communication frequency in static deployments, before jamming fully disrupts the communication link. Although some contributions recently demonstrated the feasibility of detecting low-power and weak jamming signals, they make simplistic assumptions far from real-world deployments. Given the current state of the art, no evidence exists that detection of weak jamming can be considered with real-world communication technologies. In this paper, we provide and comprehensively analyze new general-purpose strategies for detecting weak jamming signals, compatible by design with one of the most relevant communication technologies used by commercial-off-the-shelf devices, i.e., IEEE 802.11. We describe two operational modes: (i) binary classification via Convolutional Neural Networks and (ii) one-class classification via Sparse Autoencoders. We evaluate and compare the proposed approaches with the current state-of-the-art using data collected through an extensive real-world experimental campaign in three relevant environments. At the same time, we made the dataset available to the public. Our results demonstrate that detecting weak jamming signals is feasible in all considered real-world environments, and we provide an in-depth analysis considering different techniques, scenarios, and mobility patterns.

Title: Faster and Better LLMs via Latency-Aware Test-Time Scaling

Authors: Zili Wang, Tianyu Zhang, Haoli Bai, Lu Hou, Xianzhi Yu, Wulong Liu, Shiming Xiang, Lei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19634
Pdf URL: https://arxiv.org/pdf/2505.19634
Copy Paste: [[2505.19634]] Faster and Better LLMs via Latency-Aware Test-Time Scaling(https://arxiv.org/abs/2505.19634)
Keywords: large language model
Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

Title: Interleaved Reasoning for Large Language Models via Reinforcement Learning

Authors: Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19640
Pdf URL: https://arxiv.org/pdf/2505.19640
Copy Paste: [[2505.19640]] Interleaved Reasoning for Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2505.19640)
Keywords: large language model
Abstract: Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

Title: MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Authors: Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19645
Pdf URL: https://arxiv.org/pdf/2505.19645
Copy Paste: [[2505.19645]] MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE(https://arxiv.org/abs/2505.19645)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

Title: Energy-based generator matching: A neural sampler for general state space

Authors: Dongyeop Woo, Minsu Kim, Minkyu Kim, Kiyoung Seong, Sungsoo Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19646
Pdf URL: https://arxiv.org/pdf/2505.19646
Copy Paste: [[2505.19646]] Energy-based generator matching: A neural sampler for general state space(https://arxiv.org/abs/2505.19646)
Keywords: diffusion, generative
Abstract: We propose Energy-based generator matching (EGM), a modality-agnostic approach to train generative models from energy functions in the absence of data. Extending the recently proposed generator matching, EGM enables training of arbitrary continuous-time Markov processes, e.g., diffusion, flow, and jump, and can generate data from continuous, discrete, and a mixture of two modalities. To this end, we propose estimating the generator matching loss using self-normalized importance sampling with an additional bootstrapping trick to reduce variance in the importance weight. We validate EGM on both discrete and multimodal tasks up to 100 and 20 dimensions, respectively.

Title: Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Authors: Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Qi Wang, Fuzheng Zhang, Guorui Zhou
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2505.19650
Pdf URL: https://arxiv.org/pdf/2505.19650
Copy Paste: [[2505.19650]] Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval(https://arxiv.org/abs/2505.19650)
Keywords: robust
Abstract: Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at this https URL.

Title: ReDDiT: Rehashing Noise for Discrete Visual Generation

Authors: Tianren Ma, Xiaosong Zhang, Boyu Yang, Junlan Feng, Qixiang Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19656
Pdf URL: https://arxiv.org/pdf/2505.19656
Copy Paste: [[2505.19656]] ReDDiT: Rehashing Noise for Discrete Visual Generation(https://arxiv.org/abs/2505.19656)
Keywords: diffusion, transformer, generative
Abstract: Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.

Title: LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation

Authors: Piyush Tiwary, Kinjawl Bhattacharyya, Prathosh A.P
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19659
Pdf URL: https://arxiv.org/pdf/2505.19659
Copy Paste: [[2505.19659]] LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation(https://arxiv.org/abs/2505.19659)
Keywords: segmentation
Abstract: Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DAug). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DAug methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novel $\textbf{Lang}$evin $\textbf{D}$ata $\textbf{Aug}$mentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available at this https URL.

Title: GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models

Authors: Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19660
Pdf URL: https://arxiv.org/pdf/2505.19660
Copy Paste: [[2505.19660]] GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models(https://arxiv.org/abs/2505.19660)
Keywords: large language model
Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at this https URL

Title: LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

Authors: Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19667
Pdf URL: https://arxiv.org/pdf/2505.19667
Copy Paste: [[2505.19667]] LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation(https://arxiv.org/abs/2505.19667)
Keywords: large language model
Abstract: Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

Title: Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

Authors: Tengda Huang, Yu Zhang, Tianren Li, Yufu Qu, Fulin Liu, Zhenzhong Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19668
Pdf URL: https://arxiv.org/pdf/2505.19668
Copy Paste: [[2505.19668]] Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding(https://arxiv.org/abs/2505.19668)
Keywords: extraction, transformer
Abstract: Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.

Title: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19670
Pdf URL: https://arxiv.org/pdf/2505.19670
Copy Paste: [[2505.19670]] Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models(https://arxiv.org/abs/2505.19670)
Keywords: large language model
Abstract: Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

Title: Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations

Authors: Chaoyi Xiang, Chunhua Liu, Simon De Deyne, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19674
Pdf URL: https://arxiv.org/pdf/2505.19674
Copy Paste: [[2505.19674]] Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations(https://arxiv.org/abs/2505.19674)
Keywords: robust, large language model
Abstract: As the impact of large language models increases, understanding the moral values they reflect becomes ever more important. Assessing the nature of moral values as understood by these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs' moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral conceptualizations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.

Title: Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement

Authors: Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19675
Pdf URL: https://arxiv.org/pdf/2505.19675
Copy Paste: [[2505.19675]] Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement(https://arxiv.org/abs/2505.19675)
Keywords: robust, diffusion, large language model
Abstract: The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.

Title: Deep Actor-Critics with Tight Risk Certificates

Authors: Bahareh Tasdighi, Manuel Haussmann, Yi-Shan Wu, Andres R. Masegosa, Melih Kandemir
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19682
Pdf URL: https://arxiv.org/pdf/2505.19682
Copy Paste: [[2505.19682]] Deep Actor-Critics with Tight Risk Certificates(https://arxiv.org/abs/2505.19682)
Keywords: large language model
Abstract: After a period of research, deep actor-critic algorithms have reached a level where they influence our everyday lives. They serve as the driving force behind the continual improvement of large language models through user-collected feedback. However, their deployment in physical systems is not yet widely adopted, mainly because no validation scheme that quantifies their risk of malfunction. We demonstrate that it is possible to develop tight risk certificates for deep actor-critic algorithms that predict generalization performance from validation-time observations. Our key insight centers on the effectiveness of minimal evaluation data. Surprisingly, a small feasible of evaluation roll-outs collected from a pretrained policy suffices to produce accurate risk certificates when combined with a simple adaptation of PAC-Bayes theory. Specifically, we adopt a recently introduced recursive PAC-Bayes approach, which splits validation data into portions and recursively builds PAC-Bayes bounds on the excess loss of each portion's predictor, using the predictor from the previous portion as a data-informed prior. Our empirical results across multiple locomotion tasks and policy expertise levels demonstrate risk certificates that are tight enough to be considered for practical use.

Title: VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models

Authors: Bingrui Sima, Linhua Cong, Wenxuan Wang, Kun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19684
Pdf URL: https://arxiv.org/pdf/2505.19684
Copy Paste: [[2505.19684]] VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models(https://arxiv.org/abs/2505.19684)
Keywords: security, attack, large language model
Abstract: The emergence of Multimodal Large Language Models (MLRMs) has enabled sophisticated visual reasoning capabilities by integrating reinforcement learning and Chain-of-Thought (CoT) supervision. However, while these enhanced reasoning capabilities improve performance, they also introduce new and underexplored safety risks. In this work, we systematically investigate the security implications of advanced visual reasoning in MLRMs. Our analysis reveals a fundamental trade-off: as visual reasoning improves, models become more vulnerable to jailbreak attacks. Motivated by this critical finding, we introduce VisCRA (Visual Chain Reasoning Attack), a novel jailbreak framework that exploits the visual reasoning chains to bypass safety mechanisms. VisCRA combines targeted visual attention masking with a two-stage reasoning induction strategy to precisely control harmful outputs. Extensive experiments demonstrate VisCRA's significant effectiveness, achieving high attack success rates on leading closed-source MLRMs: 76.48% on Gemini 2.0 Flash Thinking, 68.56% on QvQ-Max, and 56.60% on GPT-4o. Our findings highlight a critical insight: the very capability that empowers MLRMs -- their visual reasoning -- can also serve as an attack vector, posing significant security risks.

Title: Graph Guided Diffusion: Unified Guidance for Conditional Graph Generation

Authors: Victor M. Tenorio, Nicolas Zilberstein, Santiago Segarra, Antonio G. Marques
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19685
Pdf URL: https://arxiv.org/pdf/2505.19685
Copy Paste: [[2505.19685]] Graph Guided Diffusion: Unified Guidance for Conditional Graph Generation(https://arxiv.org/abs/2505.19685)
Keywords: fair, diffusion, generative
Abstract: Diffusion models have emerged as powerful generative models for graph generation, yet their use for conditional graph generation remains a fundamental challenge. In particular, guiding diffusion models on graphs under arbitrary reward signals is difficult: gradient-based methods, while powerful, are often unsuitable due to the discrete and combinatorial nature of graphs, and non-differentiable rewards further complicate gradient-based guidance. We propose Graph Guided Diffusion (GGDiff), a novel guidance framework that interprets conditional diffusion on graphs as a stochastic control problem to address this challenge. GGDiff unifies multiple guidance strategies, including gradient-based guidance (for differentiable rewards), control-based guidance (using control signals from forward reward evaluations), and zero-order approximations (bridging gradient-based and gradient-free optimization). This comprehensive, plug-and-play framework enables zero-shot guidance of pre-trained diffusion models under both differentiable and non-differentiable reward functions, adapting well-established guidance techniques to graph generation--a direction largely unexplored. Our formulation balances computational efficiency, reward alignment, and sample quality, enabling practical conditional generation across diverse reward types. We demonstrate the efficacy of GGDiff in various tasks, including constraints on graph motifs, fairness, and link prediction, achieving superior alignment with target rewards while maintaining diversity and fidelity.

Title: DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

Authors: Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19692
Pdf URL: https://arxiv.org/pdf/2505.19692
Copy Paste: [[2505.19692]] DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving(https://arxiv.org/abs/2505.19692)
Keywords: generative
Abstract: Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at this https URL for facilitating future research.

Title: Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

Authors: Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, Tao He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19694
Pdf URL: https://arxiv.org/pdf/2505.19694
Copy Paste: [[2505.19694]] Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition(https://arxiv.org/abs/2505.19694)
Keywords: diffusion
Abstract: Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at this https URL.

Title: Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality

Authors: Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Nour Aburaed, Alessandro Bruno
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19696
Pdf URL: https://arxiv.org/pdf/2505.19696
Copy Paste: [[2505.19696]] Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality(https://arxiv.org/abs/2505.19696)
Keywords: robust
Abstract: This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.

Title: JEDI: Latent End-to-end Diffusion Mitigates Agent-Human Performance Asymmetry in Model-Based Reinforcement Learning

Authors: Jing Yu Lim, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.19698
Pdf URL: https://arxiv.org/pdf/2505.19698
Copy Paste: [[2505.19698]] JEDI: Latent End-to-end Diffusion Mitigates Agent-Human Performance Asymmetry in Model-Based Reinforcement Learning(https://arxiv.org/abs/2505.19698)
Keywords: diffusion
Abstract: Recent advances in model-based reinforcement learning (MBRL) have achieved super-human level performance on the Atari100k benchmark, driven by reinforcement learning agents trained on powerful diffusion world models. However, we identify that the current aggregates mask a major performance asymmetry: MBRL agents dramatically outperform humans in some tasks despite drastically underperforming in others, with the former inflating the aggregate metrics. This is especially pronounced in pixel-based agents trained with diffusion world models. In this work, we address the pronounced asymmetry observed in pixel-based agents as an initial attempt to reverse the worrying upward trend observed in them. We address the problematic aggregates by delineating all tasks as Agent-Optimal or Human-Optimal and advocate for equal importance on metrics from both sets. Next, we hypothesize this pronounced asymmetry is due to the lack of temporally-structured latent space trained with the World Model objective in pixel-based methods. Lastly, to address this issue, we propose Joint Embedding DIffusion (JEDI), a novel latent diffusion world model trained end-to-end with the self-consistency objective. JEDI outperforms SOTA models in human-optimal tasks while staying competitive across the Atari100k benchmark, and runs 3 times faster with 43% lower memory than the latest pixel-based diffusion baseline. Overall, our work rethinks what it truly means to cross human-level performance in Atari100k.

Title: Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

Authors: Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2505.19699
Pdf URL: https://arxiv.org/pdf/2505.19699
Copy Paste: [[2505.19699]] Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments(https://arxiv.org/abs/2505.19699)
Keywords: privacy, robust, federate, data-free, generative
Abstract: Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at this https URL.

Title: Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Authors: Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19700
Pdf URL: https://arxiv.org/pdf/2505.19700
Copy Paste: [[2505.19700]] Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models(https://arxiv.org/abs/2505.19700)
Keywords: large language model
Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

Title: Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Authors: Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19702
Pdf URL: https://arxiv.org/pdf/2505.19702
Copy Paste: [[2505.19702]] Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning(https://arxiv.org/abs/2505.19702)
Keywords: large language model
Abstract: Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

Title: Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Authors: Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, Soujanya Poria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19706
Pdf URL: https://arxiv.org/pdf/2505.19706
Copy Paste: [[2505.19706]] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision(https://arxiv.org/abs/2505.19706)
Keywords: large language model
Abstract: Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.

Title: MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

Authors: Rong-Cheng Tu, Zhao Jin, Jingyi Liao, Xiao Luo, Yingjie Wang, Li Shen, Dacheng Tao
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.19707
Pdf URL: https://arxiv.org/pdf/2505.19707
Copy Paste: [[2505.19707]] MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval(https://arxiv.org/abs/2505.19707)
Keywords: robust, large language model
Abstract: Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM's semantic alignment strengths with the MLLM's reasoning capabilities.

Title: On the Relation between Rectified Flows and Optimal Transport

Authors: Johannes Hertrich, Antonin Chambolle, Julie Delon
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19712
Pdf URL: https://arxiv.org/pdf/2505.19712
Copy Paste: [[2505.19712]] On the Relation between Rectified Flows and Optimal Transport(https://arxiv.org/abs/2505.19712)
Keywords: generative
Abstract: This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counter-examples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.

Title: MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Authors: Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19714
Pdf URL: https://arxiv.org/pdf/2505.19714
Copy Paste: [[2505.19714]] MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning(https://arxiv.org/abs/2505.19714)
Keywords: robust, large language model
Abstract: Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

Title: Graceful Forgetting in Generative Language Models

Authors: Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue, Yike Guo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19715
Pdf URL: https://arxiv.org/pdf/2505.19715
Copy Paste: [[2505.19715]] Graceful Forgetting in Generative Language Models(https://arxiv.org/abs/2505.19715)
Keywords: generative
Abstract: Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.

Title: Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

Authors: Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19722
Pdf URL: https://arxiv.org/pdf/2505.19722
Copy Paste: [[2505.19722]] Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking(https://arxiv.org/abs/2505.19722)
Keywords: large language model
Abstract: Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

Title: Cross-Sequence Semi-Supervised Learning for Multi-Parametric MRI-Based Visual Pathway Delineation

Authors: Alou Diakite, Cheng Li, Lei Xie, Yuanjing Feng, Ruoyou Wu, Jianzhong He, Hairong Zheng, Shanshan Wang
Subjects: cs.CV, cs.CE
Abstract URL: https://arxiv.org/abs/2505.19733
Pdf URL: https://arxiv.org/pdf/2505.19733
Copy Paste: [[2505.19733]] Cross-Sequence Semi-Supervised Learning for Multi-Parametric MRI-Based Visual Pathway Delineation(https://arxiv.org/abs/2505.19733)
Keywords: diffusion
Abstract: Accurately delineating the visual pathway (VP) is crucial for understanding the human visual system and diagnosing related disorders. Exploring multi-parametric MR imaging data has been identified as an important way to delineate VP. However, due to the complex cross-sequence relationships, existing methods cannot effectively model the complementary information from different MRI sequences. In addition, these existing methods heavily rely on large training data with labels, which is labor-intensive and time-consuming to obtain. In this work, we propose a novel semi-supervised multi-parametric feature decomposition framework for VP delineation. Specifically, a correlation-constrained feature decomposition (CFD) is designed to handle the complex cross-sequence relationships by capturing the unique characteristics of each MRI sequence and easing the multi-parametric information fusion process. Furthermore, a consistency-based sample enhancement (CSE) module is developed to address the limited labeled data issue, by generating and promoting meaningful edge information from unlabeled data. We validate our framework using two public datasets, and one in-house Multi-Shell Diffusion MRI (MDM) dataset. Experimental results demonstrate the superiority of our approach in terms of delineation performance when compared to seven state-of-the-art approaches.

Title: Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data

Authors: Weichen Si, Yihao Ou, Zhen Tian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19740
Pdf URL: https://arxiv.org/pdf/2505.19740
Copy Paste: [[2505.19740]] Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data(https://arxiv.org/abs/2505.19740)
Keywords: extraction
Abstract: In this study, we propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm combines CNN and RNN to effectively remove the sequencing noise, and improves the signal-to-noise ratio by 9.4 dB. We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy. We successfully identified 57 new candidate disease-causing genes in a cardiovascular disease cohort validation, and detected 3 missed variants in clinical applications. The method significantly outperforms existing tools and provides strong support for accurate diagnosis of genetic diseases.

Title: HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

Authors: Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19742
Pdf URL: https://arxiv.org/pdf/2505.19742
Copy Paste: [[2505.19742]] HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance(https://arxiv.org/abs/2505.19742)
Keywords: robust, fair, diffusion, segmentation
Abstract: Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: this https URL.

Title: Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models

Authors: Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19743
Pdf URL: https://arxiv.org/pdf/2505.19743
Copy Paste: [[2505.19743]] Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models(https://arxiv.org/abs/2505.19743)
Keywords: large language model
Abstract: With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs.

Title: Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation

Authors: Jakov Samardžija, Donik Vršnak, Sven Lončarić
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19746
Pdf URL: https://arxiv.org/pdf/2505.19746
Copy Paste: [[2505.19746]] Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation(https://arxiv.org/abs/2505.19746)
Keywords: robust
Abstract: Accurate identification of acute cellular rejection (ACR) in endomyocardial biopsies is essential for effective management of heart transplant patients. However, the rarity of high-grade rejection cases (3R) presents a significant challenge for training robust deep learning models. This work addresses the class imbalance problem by leveraging synthetic data generation using StyleGAN to augment the limited number of real 3R images. Prior to GAN training, histogram equalization was applied to standardize image appearance and improve the consistency of tissue representation. StyleGAN was trained on available 3R biopsy patches and subsequently used to generate 10,000 realistic synthetic images. These were combined with real 0R samples, that is samples without rejection, in various configurations to train ResNet-18 classifiers for binary rejection classification. Three classifier variants were evaluated: one trained on real 0R and synthetic 3R images, another using both synthetic and additional real samples, and a third trained solely on real data. All models were tested on an independent set of real biopsy images. Results demonstrate that synthetic data improves classification performance, particularly when used in combination with real samples. The highest-performing model, which used both real and synthetic images, achieved strong precision and recall for both classes. These findings underscore the value of hybrid training strategies and highlight the potential of GAN-based data augmentation in biomedical image analysis, especially in domains constrained by limited annotated datasets.

Title: SuperAD: A Training-free Anomaly Classification and Segmentation Method for CVPR 2025 VAND 3.0 Workshop Challenge Track 1: Adapt & Detect

Authors: Huaiyuan Zhang, Hang Chen, Yu Cheng, Shunyi Wu, Linghao Sun, Linao Han, Zeyu Shi, Lei Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19750
Pdf URL: https://arxiv.org/pdf/2505.19750
Copy Paste: [[2505.19750]] SuperAD: A Training-free Anomaly Classification and Segmentation Method for CVPR 2025 VAND 3.0 Workshop Challenge Track 1: Adapt & Detect(https://arxiv.org/abs/2505.19750)
Keywords: robust, extraction, segmentation
Abstract: In this technical report, we present our solution to the CVPR 2025 Visual Anomaly and Novelty Detection (VAND) 3.0 Workshop Challenge Track 1: Adapt & Detect: Robust Anomaly Detection in Real-World Applications. In real-world industrial anomaly detection, it is crucial to accurately identify anomalies with physical complexity, such as transparent or reflective surfaces, occlusions, and low-contrast contaminations. The recently proposed MVTec AD 2 dataset significantly narrows the gap between publicly available benchmarks and anomalies found in real-world industrial environments. To address the challenges posed by this dataset--such as complex and varying lighting conditions and real anomalies with large scale differences--we propose a fully training-free anomaly detection and segmentation method based on feature extraction using the DINOv2 model named SuperAD. Our method carefully selects a small number of normal reference images and constructs a memory bank by leveraging the strong representational power of DINOv2. Anomalies are then segmented by performing nearest neighbor matching between test image features and the memory bank. Our method achieves competitive results on both test sets of the MVTec AD 2 dataset.

Title: SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model

Authors: Hala Djeghim, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Céline Loscos, Désiré Sidibé
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19751
Pdf URL: https://arxiv.org/pdf/2505.19751
Copy Paste: [[2505.19751]] SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model(https://arxiv.org/abs/2505.19751)
Keywords: diffusion
Abstract: Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.

Title: Discrete Markov Bridge

Authors: Hengli Li, Yuxuan Wang, Song-Chun Zhu, Ying Nian Wu, Zilong Zheng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19752
Pdf URL: https://arxiv.org/pdf/2505.19752
Copy Paste: [[2505.19752]] Discrete Markov Bridge(https://arxiv.org/abs/2505.19752)
Keywords: diffusion
Abstract: Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.

Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Authors: Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19754
Pdf URL: https://arxiv.org/pdf/2505.19754
Copy Paste: [[2505.19754]] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering(https://arxiv.org/abs/2505.19754)
Keywords: large language model
Abstract: The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at this https URL.

Title: Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

Authors: Patara Trirat, Wonyong Jeong, Sung Ju Hwang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19764
Pdf URL: https://arxiv.org/pdf/2505.19764
Copy Paste: [[2505.19764]] Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding(https://arxiv.org/abs/2505.19764)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.

Title: SGM: A Framework for Building Specification-Guided Moderation Filters

Authors: Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19766
Pdf URL: https://arxiv.org/pdf/2505.19766
Copy Paste: [[2505.19766]] SGM: A Framework for Building Specification-Guided Moderation Filters(https://arxiv.org/abs/2505.19766)
Keywords: large language model
Abstract: Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.

Title: What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

Authors: Sangyeop Kim, Yohan Lee, Yongwoo Song, Kimin Lee
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.19773
Pdf URL: https://arxiv.org/pdf/2505.19773
Copy Paste: [[2505.19773]] What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs(https://arxiv.org/abs/2505.19773)
Keywords: attack, large language model
Abstract: We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.

Title: Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification

Authors: Akram Elbouanani, Evan Dufraisse, Adrian Popescu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19776
Pdf URL: https://arxiv.org/pdf/2505.19776
Copy Paste: [[2505.19776]] Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification(https://arxiv.org/abs/2505.19776)
Keywords: robust
Abstract: Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.

Title: What Can RL Bring to VLA Generalization? An Empirical Study

Authors: Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19789
Pdf URL: https://arxiv.org/pdf/2505.19789
Copy Paste: [[2505.19789]] What Can RL Bring to VLA Generalization? An Empirical Study(https://arxiv.org/abs/2505.19789)
Keywords: robust
Abstract: Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at this https URL

Title: The Missing Point in Vision Transformers for Universal Image Segmentation

Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19795
Pdf URL: https://arxiv.org/pdf/2505.19795
Copy Paste: [[2505.19795]] The Missing Point in Vision Transformers for Universal Image Segmentation(https://arxiv.org/abs/2505.19795)
Keywords: robust, transformer, segmentation
Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: this https URL}{this https URL.

Title: The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Authors: Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19797
Pdf URL: https://arxiv.org/pdf/2505.19797
Copy Paste: [[2505.19797]] The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants(https://arxiv.org/abs/2505.19797)
Keywords: robust
Abstract: As proprietary giants increasingly dominate the race for ever-larger language models, a pressing question arises for the open-source community: can smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers--a simple recipe that effectively leverages the collective intelligence of open-source, smaller language models. Our framework is built upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response using the Self-Consistency or its multi-model variant. Remarkably, with 10 open-source models (~7B parameters each), the Avengers collectively outperforms GPT-4.1 on 10 out of 15 datasets (spanning mathematics, code, logic, knowledge, and affective tasks). In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter--the number of clusters. We have open-sourced the code on GitHub: this https URL

Title: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Authors: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19800
Pdf URL: https://arxiv.org/pdf/2505.19800
Copy Paste: [[2505.19800]] MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs(https://arxiv.org/abs/2505.19800)
Keywords: robust, extraction, large language model
Abstract: Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: this https URL and dataset: this https URL for the research community.

Title: GraphAU-Pain: Graph-based Action Unit Representation for Pain Intensity Estimation

Authors: Zhiyu Wang, Yang Liu, Hatice Gunes
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19802
Pdf URL: https://arxiv.org/pdf/2505.19802
Copy Paste: [[2505.19802]] GraphAU-Pain: Graph-based Action Unit Representation for Pain Intensity Estimation(https://arxiv.org/abs/2505.19802)
Keywords: interpretability
Abstract: Understanding pain-related facial behaviors is essential for digital healthcare in terms of effective monitoring, assisted diagnostics, and treatment planning, particularly for patients unable to communicate verbally. Existing data-driven methods of detecting pain from facial expressions are limited due to interpretability and severity quantification. To this end, we propose GraphAU-Pain, leveraging a graph-based framework to model facial Action Units (AUs) and their interrelationships for pain intensity estimation. AUs are represented as graph nodes, with co-occurrence relationships as edges, enabling a more expressive depiction of pain-related facial behaviors. By utilizing a relational graph neural network, our framework offers improved interpretability and significant performance gains. Experiments conducted on the publicly available UNBC dataset demonstrate the effectiveness of the GraphAU-Pain, achieving an F1-score of 66.21% and accuracy of 87.61% in pain intensity estimation.

Title: Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation

Authors: Siyuan Li, Jian Chen, Rui Yao, Xuming Hu, Peilin Zhou, Weihua Qiu, Simin Zhang, Chucheng Dong, Zhiyao Li, Qipeng Xie, Zixuan Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19804
Pdf URL: https://arxiv.org/pdf/2505.19804
Copy Paste: [[2505.19804]] Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation(https://arxiv.org/abs/2505.19804)
Keywords: large language model
Abstract: Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.

Title: Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks

Authors: Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19806
Pdf URL: https://arxiv.org/pdf/2505.19806
Copy Paste: [[2505.19806]] Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks(https://arxiv.org/abs/2505.19806)
Keywords: large language model
Abstract: Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at this https URL.

Title: Density Ratio-Free Doubly Robust Proxy Causal Learning

Authors: Bariscan Bozkurt, Houssam Zenati, Dimitri Meunier, Liyuan Xu, Arthur Gretton
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19807
Pdf URL: https://arxiv.org/pdf/2505.19807
Copy Paste: [[2505.19807]] Density Ratio-Free Doubly Robust Proxy Causal Learning(https://arxiv.org/abs/2505.19807)
Keywords: robust
Abstract: We study the problem of causal function estimation in the Proxy Causal Learning (PCL) framework, where confounders are not observed but proxies for the confounders are available. Two main approaches have been proposed: outcome bridge-based and treatment bridge-based methods. In this work, we propose two kernel-based doubly robust estimators that combine the strengths of both approaches, and naturally handle continuous and high-dimensional variables. Our identification strategy builds on a recent density ratio-free method for treatment bridge-based PCL; furthermore, in contrast to previous approaches, it does not require indicator functions or kernel smoothing over the treatment variable. These properties make it especially well-suited for continuous or high-dimensional treatments. By using kernel mean embeddings, we have closed-form solutions and strong consistency guarantees. Our estimators outperform existing methods on PCL benchmarks, including a prior doubly robust method that requires both kernel smoothing and density ratio estimation.

Title: Efficient Multi-modal Long Context Learning for Training-free Adaptation

Authors: Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19812
Pdf URL: https://arxiv.org/pdf/2505.19812
Copy Paste: [[2505.19812]] Efficient Multi-modal Long Context Learning for Training-free Adaptation(https://arxiv.org/abs/2505.19812)
Keywords: large language model
Abstract: Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at this https URL.

Title: GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

Authors: You Wang, Li Fang, Hao Zhu, Fei Hu, Long Ye, Zhan Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19813
Pdf URL: https://arxiv.org/pdf/2505.19813
Copy Paste: [[2505.19813]] GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis(https://arxiv.org/abs/2505.19813)
Keywords: transformer
Abstract: Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at this https URL.

Title: Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19815
Pdf URL: https://arxiv.org/pdf/2505.19815
Copy Paste: [[2505.19815]] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective(https://arxiv.org/abs/2505.19815)
Keywords: large language model
Abstract: We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

Title: InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory

Authors: Feifei Li, Mi Zhang, Zhaoxiang Wang, Min Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19820
Pdf URL: https://arxiv.org/pdf/2505.19820
Copy Paste: [[2505.19820]] InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory(https://arxiv.org/abs/2505.19820)
Keywords: interpretability
Abstract: Interpretability of point cloud (PC) models becomes imperative given their deployment in safety-critical scenarios such as autonomous vehicles. We focus on attributing PC model outputs to interpretable critical concepts, defined as meaningful subsets of the input point cloud. To enable human-understandable diagnostics of model failures, an ideal critical subset should be *faithful* (preserving points that causally influence predictions) and *conceptually coherent* (forming semantically meaningful structures that align with human perception). We propose InfoCons, an explanation framework that applies information-theoretic principles to decompose the point cloud into 3D concepts, enabling the examination of their causal effect on model predictions with learnable priors. We evaluate InfoCons on synthetic datasets for classification, comparing it qualitatively and quantitatively with four baselines. We further demonstrate its scalability and flexibility on two real-world datasets and in two applications that utilize critical scores of PC.

Title: Poison in the Well: Feature Embedding Disruption in Backdoor Attacks

Authors: Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Shouling Ji
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19821
Pdf URL: https://arxiv.org/pdf/2505.19821
Copy Paste: [[2505.19821]] Poison in the Well: Feature Embedding Disruption in Backdoor Attacks(https://arxiv.org/abs/2505.19821)
Keywords: defense, attack, robust, steal
Abstract: Backdoor attacks embed malicious triggers into training data, enabling attackers to manipulate neural network behavior during inference while maintaining high accuracy on benign inputs. However, existing backdoor attacks face limitations manifesting in excessive reliance on training data, poor stealth, and instability, which hinder their effectiveness in real-world applications. Therefore, this paper introduces ShadowPrint, a versatile backdoor attack that targets feature embeddings within neural networks to achieve high ASRs and stealthiness. Unlike traditional approaches, ShadowPrint reduces reliance on training data access and operates effectively with exceedingly low poison rates (as low as 0.01%). It leverages a clustering-based optimization strategy to align feature embeddings, ensuring robust performance across diverse scenarios while maintaining stability and stealth. Extensive evaluations demonstrate that ShadowPrint achieves superior ASR (up to 100%), steady CA (with decay no more than 1% in most cases), and low DDR (averaging below 5%) across both clean-label and dirty-label settings, and with poison rates ranging from as low as 0.01% to 0.05%, setting a new standard for backdoor attack capabilities and emphasizing the need for advanced defense strategies focused on feature space manipulations.

Title: LAPA-based Dynamic Privacy Optimization for Wireless Federated Learning in Heterogeneous Environments

Authors: Pengcheng Sun, Erwu Liu, Wei Ni, Rui Wang, Yuanzhe Geng, Lijuan Lai, Abbas Jamalipour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19823
Pdf URL: https://arxiv.org/pdf/2505.19823
Copy Paste: [[2505.19823]] LAPA-based Dynamic Privacy Optimization for Wireless Federated Learning in Heterogeneous Environments(https://arxiv.org/abs/2505.19823)
Keywords: privacy, protect, attack, federate
Abstract: Federated Learning (FL) is a distributed machine learning paradigm based on protecting data privacy of devices, which however, can still be broken by gradient leakage attack via parameter inversion techniques. Differential privacy (DP) technology reduces the risk of private data leakage by adding artificial noise to the gradients, but detrimental to the FL utility at the same time, especially in the scenario where the data is Non-Independent Identically Distributed (Non-IID). Based on the impact of heterogeneous data on aggregation performance, this paper proposes a Lightweight Adaptive Privacy Allocation (LAPA) strategy, which assigns personalized privacy budgets to devices in each aggregation round without transmitting any additional information beyond gradients, ensuring both privacy protection and aggregation efficiency. Furthermore, the Deep Deterministic Policy Gradient (DDPG) algorithm is employed to optimize the transmission power, in order to determine the optimal timing at which the adaptively attenuated artificial noise aligns with the communication noise, enabling an effective balance between DP and system utility. Finally, a reliable aggregation strategy is designed by integrating communication quality and data distribution characteristics, which improves aggregation performance while preserving privacy. Experimental results demonstrate that the personalized noise allocation and dynamic optimization strategy based on LAPA proposed in this paper enhances convergence performance while satisfying the privacy requirements of FL.

Title: Foundation Models for Tabular Data within Systemic Contexts Need Grounding

Authors: Tassilo Klein, Johannes Hoffart
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.19825
Pdf URL: https://arxiv.org/pdf/2505.19825
Copy Paste: [[2505.19825]] Foundation Models for Tabular Data within Systemic Contexts Need Grounding(https://arxiv.org/abs/2505.19825)
Keywords: robust
Abstract: Current research on tabular foundation models often overlooks the complexities of large-scale, real-world data by treating tables as isolated entities and assuming information completeness, thereby neglecting the vital operational context. To address this, we introduce the concept of Semantically Linked Tables (SLT), recognizing that tables are inherently connected to both declarative and procedural operational knowledge. We propose Foundation Models for Semantically Linked Tables (FMSLT), which integrate these components to ground tabular data within its true operational context. This comprehensive representation unlocks the full potential of machine learning for complex, interconnected tabular data across diverse domains. Realizing FMSLTs requires access to operational knowledge that is often unavailable in public datasets, highlighting the need for close collaboration between domain experts and researchers. Our work exposes the limitations of current tabular foundation models and proposes a new direction centered on FMSLTs, aiming to advance robust, context-aware models for structured data.

Title: FoodTaxo: Generating Food Taxonomies with Large Language Models

Authors: Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19838
Pdf URL: https://arxiv.org/pdf/2505.19838
Copy Paste: [[2505.19838]] FoodTaxo: Generating Food Taxonomies with Large Language Models(https://arxiv.org/abs/2505.19838)
Keywords: large language model
Abstract: We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.

Title: One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP

Authors: Binyan Xu, Xilin Dai, Di Tang, Kehuan Zhang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19840
Pdf URL: https://arxiv.org/pdf/2505.19840
Copy Paste: [[2505.19840]] One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP(https://arxiv.org/abs/2505.19840)
Keywords: security, attack
Abstract: Deep Neural Networks (DNNs) have achieved widespread success yet remain prone to adversarial attacks. Typically, such attacks either involve frequent queries to the target model or rely on surrogate models closely mirroring the target model -- often trained with subsets of the target model's training data -- to achieve high attack success rates through transferability. However, in realistic scenarios where training data is inaccessible and excessive queries can raise alarms, crafting adversarial examples becomes more challenging. In this paper, we present UnivIntruder, a novel attack framework that relies solely on a single, publicly available CLIP model and publicly available datasets. By using textual concepts, UnivIntruder generates universal, transferable, and targeted adversarial perturbations that mislead DNNs into misclassifying inputs into adversary-specified classes defined by textual concepts. Our extensive experiments show that our approach achieves an Attack Success Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly outperforming existing transfer-based methods. Additionally, we reveal real-world vulnerabilities, showing that even without querying target models, UnivIntruder compromises image search engines like Google and Baidu with ASR rates up to 84%, and vision language models like GPT-4 and Claude-3.5 with ASR rates up to 80%. These findings underscore the practicality of our attack in scenarios where traditional avenues are blocked, highlighting the need to reevaluate security paradigms in AI applications.

Title: Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

Authors: Nagito Saito, Shintaro Ito, Koichi Ito, Takafumi Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19846
Pdf URL: https://arxiv.org/pdf/2505.19846
Copy Paste: [[2505.19846]] Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation(https://arxiv.org/abs/2505.19846)
Keywords: segmentation
Abstract: Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO.

Title: Improving Multilingual Math Reasoning for African Languages

Authors: Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Esther Adenuga, David Ifeoluwa Adelani, Jimmy Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19848
Pdf URL: https://arxiv.org/pdf/2505.19848
Copy Paste: [[2505.19848]] Improving Multilingual Math Reasoning for African Languages(https://arxiv.org/abs/2505.19848)
Keywords: large language model
Abstract: Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.

Title: Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages

Authors: Gulfarogh Azam, Mohd Sadique, Saif Ali, Mohammad Nadeem, Erik Cambria, Shahab Saquib Sohail, Mohammad Sultan Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19851
Pdf URL: https://arxiv.org/pdf/2505.19851
Copy Paste: [[2505.19851]] Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages(https://arxiv.org/abs/2505.19851)
Keywords: robust, large language model
Abstract: Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.

Title: Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

Authors: Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19855
Pdf URL: https://arxiv.org/pdf/2505.19855
Copy Paste: [[2505.19855]] Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?(https://arxiv.org/abs/2505.19855)
Keywords: large language model
Abstract: Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" $\emptyset$ response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

Title: CPA-RAG:Covert Poisoning Attacks on Retrieval-Augmented Generation in Large Language Models

Authors: Chunyang Li, Junwei Zhang, Anda Cheng, Zhuo Ma, Xinghua Li, Jianfeng Ma
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.19864
Pdf URL: https://arxiv.org/pdf/2505.19864
Copy Paste: [[2505.19864]] CPA-RAG:Covert Poisoning Attacks on Retrieval-Augmented Generation in Large Language Models(https://arxiv.org/abs/2505.19864)
Keywords: secure, defense, attack, robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but its openness introduces vulnerabilities that can be exploited by poisoning attacks. Existing poisoning methods for RAG systems have limitations, such as poor generalization and lack of fluency in adversarial texts. In this paper, we propose CPA-RAG, a black-box adversarial framework that generates query-relevant texts capable of manipulating the retrieval process to induce target answers. The proposed method integrates prompt-based text generation, cross-guided optimization through multiple LLMs, and retriever-based scoring to construct high-quality adversarial samples. We conduct extensive experiments across multiple datasets and LLMs to evaluate its effectiveness. Results show that the framework achieves over 90\% attack success when the top-k retrieval setting is 5, matching white-box performance, and maintains a consistent advantage of approximately 5 percentage points across different top-k values. It also outperforms existing black-box baselines by 14.5 percentage points under various defense strategies. Furthermore, our method successfully compromises a commercial RAG system deployed on Alibaba's BaiLian platform, demonstrating its practical threat in real-world applications. These findings underscore the need for more robust and secure RAG frameworks to defend against poisoning attacks.

Title: Deep Active Inference Agents for Delayed and Long-Horizon Environments

Authors: Yavar Taheri Yeganeh, Mohsen Jafari, Andrea Matta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19867
Pdf URL: https://arxiv.org/pdf/2505.19867
Copy Paste: [[2505.19867]] Deep Active Inference Agents for Delayed and Long-Horizon Environments(https://arxiv.org/abs/2505.19867)
Keywords: generative
Abstract: With the recent success of world-model agents, which extend the core idea of model-based reinforcement learning by learning a differentiable model for sample-efficient control across diverse tasks, active inference (AIF) offers a complementary, neuroscience-grounded paradigm that unifies perception, learning, and action within a single probabilistic framework powered by a generative model. Despite this promise, practical AIF agents still rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments requiring plans over long horizons, tens to hundreds of steps. Moreover, most existing agents are evaluated on robotic or vision benchmarks which, while natural for biological agents, fall short of real-world industrial complexity. We address these limitations with a generative-policy architecture featuring (i) a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive planning from the control loop. We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings. The empirical results confirm the effectiveness of the proposed approach, demonstrating the coupled world-model with the AIF formalism yields an end-to-end probabilistic controller capable of effective decision making in delayed, long-horizon settings without handcrafted rewards or expensive planning.

Title: Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling

Authors: Junhong Lee, Seungwook Kim, Minsu Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19868
Pdf URL: https://arxiv.org/pdf/2505.19868
Copy Paste: [[2505.19868]] Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling(https://arxiv.org/abs/2505.19868)
Keywords: diffusion
Abstract: Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.

Title: Deep Spectral Prior

Authors: Yanqi Cheng, Tieyong Zeng, Pietro Lio, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2505.19873
Pdf URL: https://arxiv.org/pdf/2505.19873
Copy Paste: [[2505.19873]] Deep Spectral Prior(https://arxiv.org/abs/2505.19873)
Keywords: robust, interpretability
Abstract: We introduce Deep Spectral Prior (DSP), a new formulation of Deep Image Prior (DIP) that redefines image reconstruction as a frequency-domain alignment problem. Unlike traditional DIP, which relies on pixel-wise loss and early stopping to mitigate overfitting, DSP directly matches Fourier coefficients between the network output and observed measurements. This shift introduces an explicit inductive bias towards spectral coherence, aligning with the known frequency structure of images and the spectral bias of convolutional neural networks. We provide a rigorous theoretical framework demonstrating that DSP acts as an implicit spectral regulariser, suppressing high-frequency noise by design and eliminating the need for early stopping. Our analysis spans four core dimensions establishing smooth convergence dynamics, local stability, and favourable bias-variance tradeoffs. We further show that DSP naturally projects reconstructions onto a frequency-consistent manifold, enhancing interpretability and robustness. These theoretical guarantees are supported by empirical results across denoising, inpainting, and super-resolution tasks, where DSP consistently outperforms classical DIP and other unsupervised baselines.

Title: StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Authors: Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.19874
Pdf URL: https://arxiv.org/pdf/2505.19874
Copy Paste: [[2505.19874]] StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation(https://arxiv.org/abs/2505.19874)
Keywords: generative
Abstract: In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.

Title: Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

Authors: Chao Huang, Benfeng Wang, Jie Wen, Chengliang Liu, Wei Wang, Li Shen, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19877
Pdf URL: https://arxiv.org/pdf/2505.19877
Copy Paste: [[2505.19877]] Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought(https://arxiv.org/abs/2505.19877)
Keywords: large language model
Abstract: Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at this https URL.

Title: Generalized and Personalized Federated Learning with Foundation Models via Orthogonal Transformations

Authors: Eun Gyung Kong, Je Won Yeom, Yonghoon Jeon, Taesup Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19888
Pdf URL: https://arxiv.org/pdf/2505.19888
Copy Paste: [[2505.19888]] Generalized and Personalized Federated Learning with Foundation Models via Orthogonal Transformations(https://arxiv.org/abs/2505.19888)
Keywords: security, privacy, robust, federate
Abstract: Federated Learning (FL) aims to train models across decentralized clients or devices holding local data without the need for centralized data collection, thus enhancing data privacy and security. However, achieving both generalization and personalization in heterogeneous settings remains a significant challenge. To address this, we introduce FedOT, a novel approach that leverages black-box foundation models. FedOT shares only a global task-dependent classifier across clients while locally adapting features through orthogonal transformations. By enforcing orthogonality, FedOT mitigates gradient conflicts across diverse clients, preserves semantic integrity, and achieves robust performance even in the presence of substantial data heterogeneity. The strategy of combining global and local parameters enables a more balanced approach for both generalization and personalization, outperforming baseline FL methods across multiple benchmarks. Furthermore, our extensive analysis confirms that joint optimization of global classifiers and local orthogonal transformations yields superior performance and suggests broader applicability.

Title: OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection

Authors: David Schneider, Zdravko Marinov, Rafael Baur, Zeyun Zhong, Rodi Düger, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19889
Pdf URL: https://arxiv.org/pdf/2505.19889
Copy Paste: [[2505.19889]] OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection(https://arxiv.org/abs/2505.19889)
Keywords: robust, fair, segmentation
Abstract: Current video-based fall detection research mostly relies on small, staged datasets with significant domain biases concerning background, lighting, and camera setup resulting in unknown real-world performance. We introduce OmniFall, unifying eight public fall detection datasets (roughly 14 h of recordings, roughly 42 h of multiview data, 101 subjects, 29 camera views) under a consistent ten-class taxonomy with standardized evaluation protocols. Our benchmark provides complete video segmentation labels and enables fair cross-dataset comparison previously impossible with incompatible annotation schemes. For real-world evaluation we curate OOPS-Fall from genuine accident videos and establish a staged-to-wild protocol measuring generalization from controlled to uncontrolled environments. Experiments with frozen pre-trained backbones such as I3D or VideoMAE reveal significant performance gaps between in-distribution and in-the-wild scenarios, highlighting critical challenges in developing robust fall detection systems. OmniFall Dataset at this https URL , Code at this https URL

Title: ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

Authors: Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19893
Pdf URL: https://arxiv.org/pdf/2505.19893
Copy Paste: [[2505.19893]] ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining(https://arxiv.org/abs/2505.19893)
Keywords: robust, large language model
Abstract: Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.

Title: Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement

Authors: Afrah Shaahid, Muzammil Behzad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19895
Pdf URL: https://arxiv.org/pdf/2505.19895
Copy Paste: [[2505.19895]] Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement(https://arxiv.org/abs/2505.19895)
Keywords: diffusion
Abstract: Underwater images are often affected by complex degradations such as light absorption, scattering, color casts, and artifacts, making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references, introducing bias and limiting generalization. Furthermore, fine-tuning these models can degrade learned priors, resulting in unrealistic enhancements due to domain shifts. To address these challenges, we propose UDAN-CLIP, an image-to-image diffusion framework pre-trained on synthetic underwater datasets and enhanced with a customized classifier based on vision-language model, a spatial attention module, and a novel CLIP-Diffusion loss. The classifier preserves natural in-air priors and semantically guides the diffusion process, while the spatial attention module focuses on correcting localized degradations such as haze and low contrast. The proposed CLIP-Diffusion loss further strengthens visual-textual alignment and helps maintain semantic consistency during enhancement. The proposed contributions empower our UDAN-CLIP model to perform more effective underwater image enhancement, producing results that are not only visually compelling but also more realistic and detail-preserving. These improvements are consistently validated through both quantitative metrics and qualitative visual comparisons, demonstrating the model's ability to correct distortions and restore natural appearance in challenging underwater conditions.

Title: Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

Authors: Peng Liu, Xiaoming Ren, Fengkai Liu, Qingsong Xie, Quanlong Zheng, Yanhao Zhang, Haonan Lu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19901
Pdf URL: https://arxiv.org/pdf/2505.19901
Copy Paste: [[2505.19901]] Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM(https://arxiv.org/abs/2505.19901)
Keywords: diffusion, transformer, large language model
Abstract: Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.

Title: Attention! You Vision Language Model Could Be Maliciously Manipulated

Authors: Xiaosen Wang, Shaokang Wang, Zhijin Ge, Yuyang Luo, Shudong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19911
Pdf URL: https://arxiv.org/pdf/2505.19911
Copy Paste: [[2505.19911]] Attention! You Vision Language Model Could Be Maliciously Manipulated(https://arxiv.org/abs/2505.19911)
Keywords: privacy, protect, attack, watermark
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Vision-language model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation. Notably, VMA can be a double-edged sword: it can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples, etc, while simultaneously enabling the injection of watermarks for copyright protection. Extensive empirical evaluations substantiate the efficacy and generalizability of VMA across diverse scenarios and datasets.

Title: APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization

Authors: Javier Marín
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19912
Pdf URL: https://arxiv.org/pdf/2505.19912
Copy Paste: [[2505.19912]] APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization(https://arxiv.org/abs/2505.19912)
Keywords: large language model
Abstract: We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE's effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory's "adjacent possible", APE's core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.

Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Authors: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19914
Pdf URL: https://arxiv.org/pdf/2505.19914
Copy Paste: [[2505.19914]] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles(https://arxiv.org/abs/2505.19914)
Keywords: large language model
Abstract: Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at this https URL.

Title: A Responsible Face Recognition Approach for Small and Mid-Scale Systems Through Personalized Neural Networks

Authors: Sebastian Groß, Stefan Heindorf, Philipp Terhörst
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19920
Pdf URL: https://arxiv.org/pdf/2505.19920
Copy Paste: [[2505.19920]] A Responsible Face Recognition Approach for Small and Mid-Scale Systems Through Personalized Neural Networks(https://arxiv.org/abs/2505.19920)
Keywords: privacy, fair, explainability
Abstract: Traditional face recognition systems rely on extracting fixed face representations, known as templates, to store and verify identities. These representations are typically generated by neural networks that often lack explainability and raise concerns regarding fairness and privacy. In this work, we propose a novel model-template (MOTE) approach that replaces vector-based face templates with small personalized neural networks. This design enables more responsible face recognition for small and medium-scale systems. During enrollment, MOTE creates a dedicated binary classifier for each identity, trained to determine whether an input face matches the enrolled identity. Each classifier is trained using only a single reference sample, along with synthetically balanced samples to allow adjusting fairness at the level of a single individual during enrollment. Extensive experiments across multiple datasets and recognition systems demonstrate substantial improvements in fairness and particularly in privacy. Although the method increases inference time and storage requirements, it presents a strong solution for small- and mid-scale applications where fairness and privacy are critical.

Title: CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

Authors: Gabriele Lagani, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19928
Pdf URL: https://arxiv.org/pdf/2505.19928
Copy Paste: [[2505.19928]] CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge(https://arxiv.org/abs/2505.19928)
Keywords: privacy, robust
Abstract: In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.

Title: Logic Gate Neural Networks are Good for Verification

Authors: Fabian Kresse, Emily Yu, Christoph H. Lampert, Thomas A. Henzinger
Subjects: cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2505.19932
Pdf URL: https://arxiv.org/pdf/2505.19932
Copy Paste: [[2505.19932]] Logic Gate Neural Networks are Good for Verification(https://arxiv.org/abs/2505.19932)
Keywords: robust, fair
Abstract: Learning-based systems are increasingly deployed across various domains, yet the complexity of traditional neural networks poses significant challenges for formal verification. Unlike conventional neural networks, learned Logic Gate Networks (LGNs) replace multiplications with Boolean logic gates, yielding a sparse, netlist-like architecture that is inherently more amenable to symbolic verification, while still delivering promising performance. In this paper, we introduce a SAT encoding for verifying global robustness and fairness in LGNs. We evaluate our method on five benchmark datasets, including a newly constructed 5-class variant, and find that LGNs are both verification-friendly and maintain strong predictive performance.

Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

Authors: Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19937
Pdf URL: https://arxiv.org/pdf/2505.19937
Copy Paste: [[2505.19937]] ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs(https://arxiv.org/abs/2505.19937)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.

Title: Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Authors: Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19938
Pdf URL: https://arxiv.org/pdf/2505.19938
Copy Paste: [[2505.19938]] Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning(https://arxiv.org/abs/2505.19938)
Keywords: robust, transformer
Abstract: Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.

Title: Task-Oriented Low-Label Semantic Communication With Self-Supervised Learning

Authors: Run Gu, Wei Xu, Zhaohui Yang, Dusit Niyato, Aylin Yener
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.19940
Pdf URL: https://arxiv.org/pdf/2505.19940
Copy Paste: [[2505.19940]] Task-Oriented Low-Label Semantic Communication With Self-Supervised Learning(https://arxiv.org/abs/2505.19940)
Keywords: extraction
Abstract: Task-oriented semantic communication enhances transmission efficiency by conveying semantic information rather than exact messages. Deep learning (DL)-based semantic communication can effectively cultivate the essential semantic knowledge for semantic extraction, transmission, and interpretation by leveraging massive labeled samples for downstream task training. In this paper, we propose a self-supervised learning-based semantic communication framework (SLSCom) to enhance task inference performance, particularly in scenarios with limited access to labeled samples. Specifically, we develop a task-relevant semantic encoder using unlabeled samples, which can be collected by devices in real-world edge networks. To facilitate task-relevant semantic extraction, we introduce self-supervision for learning contrastive features and formulate the information bottleneck (IB) problem to balance the tradeoff between the informativeness of the extracted features and task inference performance. Given the computational challenges of the IB problem, we devise a practical and effective solution by employing self-supervised classification and reconstruction pretext tasks. We further propose efficient joint training methods to enhance end-to-end inference accuracy over wireless channels, even with few labeled samples. We evaluate the proposed framework on image classification tasks over multipath wireless channels. Extensive simulation results demonstrate that SLSCom significantly outperforms conventional digital coding methods and existing DL-based approaches across varying labeled data set sizes and SNR conditions, even when the unlabeled samples are irrelevant to the downstream tasks.

Title: SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection

Authors: Gokul Adethya, Bhanu Pratyush Mantha, Tianyang Wang, Xingjian Li, Min Xu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19948
Pdf URL: https://arxiv.org/pdf/2505.19948
Copy Paste: [[2505.19948]] SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection(https://arxiv.org/abs/2505.19948)
Keywords: robust, segmentation
Abstract: Cryo-electron tomography (cryo-ET) has emerged as a powerful technique for imaging macromolecular complexes in their near-native states. However, the localization of 3D particles in cellular environments still presents a significant challenge due to low signal-to-noise ratios and missing wedge artifacts. Deep learning approaches have shown great potential, but they need huge amounts of data, which can be a challenge in cryo-ET scenarios where labeled data is often scarce. In this paper, we propose a novel Self-augmented and Self-interpreted (SaSi) deep learning approach towards few-shot particle detection in 3D cryo-ET images. Our method builds upon self-augmentation techniques to further boost data utilization and introduces a self-interpreted segmentation strategy for alleviating dependency on labeled data, hence improving generalization and robustness. As demonstrated by experiments conducted on both simulated and real-world cryo-ET datasets, the SaSi approach significantly outperforms existing state-of-the-art methods for particle localization. This research increases understanding of how to detect particles with very few labels in cryo-ET and thus sets a new benchmark for few-shot learning in structural biology.

Title: Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

Authors: Siqi Kou, Qingyuan Tian, Hanwen Xu, Zihao Zeng, Zhijie Deng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19949
Pdf URL: https://arxiv.org/pdf/2505.19949
Copy Paste: [[2505.19949]] Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions(https://arxiv.org/abs/2505.19949)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. To address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics. Our Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning. Based on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\% to 20\% and boosts LiveCodeBench accuracy from 33.8\% to 35.3\% for Qwen2.5-7B-Instruct. Moreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.

Title: Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

Authors: Rong-Cheng Tu, Wenhao Sun, Hanzhe You, Yingjie Wang, Jiaxing Huang, Li Shen, Dacheng Tao
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.19952
Pdf URL: https://arxiv.org/pdf/2505.19952
Copy Paste: [[2505.19952]] Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval(https://arxiv.org/abs/2505.19952)
Keywords: large language model
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, , using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5\% over existing baselines; on CIRR, it boosts R@1 by 9.6\%; and on CIRCO, it increases mAP@5 by 9.5\%.

Title: An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning

Authors: Andrew Zamai, Nathanael Fijalkow, Boris Mansencal, Laurent Simon, Eloi Navet, Pierrick Coupe
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19954
Pdf URL: https://arxiv.org/pdf/2505.19954
Copy Paste: [[2505.19954]] An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning(https://arxiv.org/abs/2505.19954)
Keywords: explainability, transformer, generative, large language model
Abstract: The differential diagnosis of neurodegenerative dementias is a challenging clinical task, mainly because of the overlap in symptom presentation and the similarity of patterns observed in structural neuroimaging. To improve diagnostic efficiency and accuracy, deep learning-based methods such as Convolutional Neural Networks and Vision Transformers have been proposed for the automatic classification of brain MRIs. However, despite their strong predictive performance, these models find limited clinical utility due to their opaque decision making. In this work, we propose a framework that integrates two core components to enhance diagnostic transparency. First, we introduce a modular pipeline for converting 3D T1-weighted brain MRIs into textual radiology reports. Second, we explore the potential of modern Large Language Models (LLMs) to assist clinicians in the differential diagnosis between Frontotemporal dementia subtypes, Alzheimer's disease, and normal aging based on the generated reports. To bridge the gap between predictive accuracy and explainability, we employ reinforcement learning to incentivize diagnostic reasoning in LLMs. Without requiring supervised reasoning traces or distillation from larger models, our approach enables the emergence of structured diagnostic rationales grounded in neuroimaging findings. Unlike post-hoc explainability methods that retrospectively justify model decisions, our framework generates diagnostic rationales as part of the inference process-producing causally grounded explanations that inform and guide the model's decision-making process. In doing so, our framework matches the diagnostic performance of existing deep learning methods while offering rationales that support its diagnostic conclusions.

Title: UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

Authors: Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19958
Pdf URL: https://arxiv.org/pdf/2505.19958
Copy Paste: [[2505.19958]] UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space(https://arxiv.org/abs/2505.19958)
Keywords: diffusion
Abstract: Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporal-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Restoration Schedule (DRS), which estimates a degradation factor from the low-resolution input and transforms iterative denoising process into a single-step reconstruction from from low-resolution to high-resolution videos. This design eliminates randomness from diffusion noise and significantly speeds up inference. To ensure temporal consistency, we propose a lightweight yet effective Recurrent Temporal Shift (RTS) module, composed of an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, these two units collaboratively facilitate effective feature propagation, fusion, and alignment across neighboring frames, without relying on explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporal coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step.

Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Authors: Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19959
Pdf URL: https://arxiv.org/pdf/2505.19959
Copy Paste: [[2505.19959]] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models(https://arxiv.org/abs/2505.19959)
Keywords: large language model
Abstract: Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See this https URL for our code, data and tutorial.

Title: The Limits of Preference Data for Post-Training

Authors: Eric Zhao, Jessica Dai, Pranjal Awasthi
Subjects: cs.LG, cs.AI, cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2505.19964
Pdf URL: https://arxiv.org/pdf/2505.19964
Copy Paste: [[2505.19964]] The Limits of Preference Data for Post-Training(https://arxiv.org/abs/2505.19964)
Keywords: robust, large language model
Abstract: Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

Title: Learning to Select In-Context Demonstration Preferred by Large Language Model

Authors: Zheng Zhang, Shaocheng Lan, Lei Song, Jiang Bian, Yexin Li, Kan Ren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19966
Pdf URL: https://arxiv.org/pdf/2505.19966
Copy Paste: [[2505.19966]] Learning to Select In-Context Demonstration Preferred by Large Language Model(https://arxiv.org/abs/2505.19966)
Keywords: generative, large language model
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.

Title: Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models

Authors: Antti Koskela, Tejas Kulkarni
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2505.19969
Pdf URL: https://arxiv.org/pdf/2505.19969
Copy Paste: [[2505.19969]] Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models(https://arxiv.org/abs/2505.19969)
Keywords: secure, privacy, robust
Abstract: Fully decentralized training of machine learning models offers significant advantages in scalability, robustness, and fault tolerance. However, achieving differential privacy (DP) in such settings is challenging due to the absence of a central aggregator and varying trust assumptions among nodes. In this work, we present a novel privacy analysis of decentralized gossip-based averaging algorithms with additive node-level noise, both with and without secure summation over each node's direct neighbors. Our main contribution is a new analytical framework based on a linear systems formulation that accurately characterizes privacy leakage across these scenarios. This framework significantly improves upon prior analyses, for example, reducing the Rényi DP parameter growth from $O(T^2)$ to $O(T)$, where $T$ is the number of training rounds. We validate our analysis with numerical results demonstrating superior DP bounds compared to existing approaches. We further illustrate our analysis with a logistic regression experiment on MNIST image classification in a fully decentralized setting, demonstrating utility comparable to central aggregation methods.

Title: CP-Router: An Uncertainty-Aware Router Between LLM and LRM

Authors: Jiayuan Su, Fulin Lin, Zhaopeng Feng, Han Zheng, Teng Wang, Zhenyu Xiao, Xinlong Zhao, Zuozhu Liu, Lu Cheng, Hongwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19970
Pdf URL: https://arxiv.org/pdf/2505.19970
Copy Paste: [[2505.19970]] CP-Router: An Uncertainty-Aware Router Between LLM and LRM(https://arxiv.org/abs/2505.19970)
Keywords: robust, large language model
Abstract: Recent advances in Large Reasoning Models (LRMs) have significantly improved long-chain reasoning capabilities over Large Language Models (LLMs). However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To overcome this, we propose CP-Router, a training-free and model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. The routing decision is guided by the prediction uncertainty estimates derived via Conformal Prediction (CP), which provides rigorous coverage guarantees. To further refine the uncertainty differentiation across inputs, we introduce Full and Binary Entropy (FBE), a novel entropy-based criterion that adaptively selects the appropriate CP threshold. Experiments across diverse MCQA benchmarks, including mathematics, logical reasoning, and Chinese chemistry, demonstrate that CP-Router efficiently reduces token usage while maintaining or even improving accuracy compared to using LRM alone. We also extend CP-Router to diverse model pairings and open-ended QA, where it continues to demonstrate strong performance, validating its generality and robustness.

Title: Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Authors: Kilian Sennrich, Sina Ahmadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19971
Pdf URL: https://arxiv.org/pdf/2505.19971
Copy Paste: [[2505.19971]] Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language(https://arxiv.org/abs/2505.19971)
Keywords: robust
Abstract: Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

Title: DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Authors: Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19973
Pdf URL: https://arxiv.org/pdf/2505.19973
Copy Paste: [[2505.19973]] DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response(https://arxiv.org/abs/2505.19973)
Keywords: large language model
Abstract: Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at this https URL.

Title: Rethinking Probabilistic Circuit Parameter Learning

Authors: Anji Liu, Guy Van den Broeck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19982
Pdf URL: https://arxiv.org/pdf/2505.19982
Copy Paste: [[2505.19982]] Rethinking Probabilistic Circuit Parameter Learning(https://arxiv.org/abs/2505.19982)
Keywords: generative
Abstract: Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. While empirical extensions to the mini-batch setting have been proposed, it remains unclear what objective these algorithms are optimizing, making it difficult to assess their theoretical soundness. This paper bridges the gap by establishing a novel connection between the general EM objective and the standard full-batch EM algorithm. Building on this, we derive a theoretically grounded generalization to the mini-batch setting and demonstrate its effectiveness through preliminary empirical results.

Title: Structured Initialization for Vision Transformers

Authors: Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19985
Pdf URL: https://arxiv.org/pdf/2505.19985
Copy Paste: [[2505.19985]] Structured Initialization for Vision Transformers(https://arxiv.org/abs/2505.19985)
Keywords: transformer
Abstract: Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

Title: How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation

Authors: Yongshi Ye, Biao Fu, Chongxuan Huang, Yidong Chen, Xiaodong Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19987
Pdf URL: https://arxiv.org/pdf/2505.19987
Copy Paste: [[2505.19987]] How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation(https://arxiv.org/abs/2505.19987)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated strong performance in general-purpose machine translation, but their effectiveness in complex, domain-sensitive translation tasks remains underexplored. Recent advancements in Large Reasoning Models (LRMs), raise the question of whether structured reasoning can enhance translation quality across diverse domains. In this work, we compare the performance of LRMs with traditional LLMs across 15 representative domains and four translation directions. Our evaluation considers various factors, including task difficulty, input length, and terminology density. We use a combination of automatic metrics and an enhanced MQM-based evaluation hierarchy to assess translation quality. Our findings show that LRMs consistently outperform traditional LLMs in semantically complex domains, especially in long-text and high-difficulty translation scenarios. Moreover, domain-adaptive prompting strategies further improve performance by better leveraging the reasoning capabilities of LRMs. These results highlight the potential of structured reasoning in MDMT tasks and provide valuable insights for optimizing translation systems in domain-sensitive contexts.

Title: Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Authors: Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.19997
Pdf URL: https://arxiv.org/pdf/2505.19997
Copy Paste: [[2505.19997]] Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents(https://arxiv.org/abs/2505.19997)
Keywords: large language model
Abstract: Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants'', target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student\_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100\%$ improvement in simulation accuracy.

Title: NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID

Authors: Shihao Li, Chenglong Li, Aihua Zheng, Andong Lu, Jin Tang, Jixin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20001
Pdf URL: https://arxiv.org/pdf/2505.20001
Copy Paste: [[2505.20001]] NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID(https://arxiv.org/abs/2505.20001)
Keywords: large language model
Abstract: Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.

Title: TabPFN: One Model to Rule Them All?

Authors: Qiong Zhang, Yan Shuo Tan, Qinglong Tian, Pengfei Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20003
Pdf URL: https://arxiv.org/pdf/2505.20003
Copy Paste: [[2505.20003]] TabPFN: One Model to Rule Them All?(https://arxiv.org/abs/2505.20003)
Keywords: robust, transformer, large language model
Abstract: Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim "outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time." Furthermore, they have called TabPFN a "foundation model" for tabular data, as it can support "data generation, density estimation, learning reusable embeddings and fine-tuning". If these statements are well-supported, TabPFN may have the potential to supersede existing modeling approaches on a wide range of statistical tasks, mirroring a similar revolution in other areas of artificial intelligence that began with the advent of large language models. In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We also provide more evidence of TabPFN's "foundation model" capabilities: We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. We further show that TabPFN can outperform LASSO at sparse regression and can break a robustness-efficiency trade-off in classification. All experiments can be reproduced using the code provided at this https URL (this https URL).

Title: Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition

Authors: Raphaël Bagat, Irina Illina, Emmanuel Vincent
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20006
Pdf URL: https://arxiv.org/pdf/2505.20006
Copy Paste: [[2505.20006]] Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition(https://arxiv.org/abs/2505.20006)
Keywords: robust
Abstract: We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. We introduce Mixture of Accent-Specific LoRAs (MAS-LoRA), a fine-tuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent. This method can be used when the accent is known or unknown at inference time, without the need to fine-tune the model again. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate significant improvements in Word Error Rate compared to regular LoRA and full fine-tuning when the accent is unknown. When the accent is known, the results further improve. Furthermore, MAS-LoRA shows less catastrophic forgetting than the other fine-tuning methods. To the best of our knowledge, this is the first use of a mixture of LoRA experts for non-native multi-accent ASR.

Title: WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

Authors: Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20013
Pdf URL: https://arxiv.org/pdf/2505.20013
Copy Paste: [[2505.20013]] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback(https://arxiv.org/abs/2505.20013)
Keywords: robust, large language model
Abstract: Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.

Title: Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

Authors: Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, Jong C. Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20014
Pdf URL: https://arxiv.org/pdf/2505.20014
Copy Paste: [[2505.20014]] Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation(https://arxiv.org/abs/2505.20014)
Keywords: large language model
Abstract: The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially large parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.

Title: Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Authors: Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2505.20020
Pdf URL: https://arxiv.org/pdf/2505.20020
Copy Paste: [[2505.20020]] Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare(https://arxiv.org/abs/2505.20020)
Keywords: secure, privacy, federate, large language model
Abstract: The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

Title: Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20023
Pdf URL: https://arxiv.org/pdf/2505.20023
Copy Paste: [[2505.20023]] Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking(https://arxiv.org/abs/2505.20023)
Keywords: large language model
Abstract: Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.

Title: ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

Authors: Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, Huiyong Chen, Dongbin Zhao
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.20024
Pdf URL: https://arxiv.org/pdf/2505.20024
Copy Paste: [[2505.20024]] ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving(https://arxiv.org/abs/2505.20024)
Keywords: large language model
Abstract: Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in this https URL.

Title: Gradient Inversion Transcript: Leveraging Robust Generative Priors to Reconstruct Training Data from Gradient Leakage

Authors: Xinping Chen, Chen Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20026
Pdf URL: https://arxiv.org/pdf/2505.20026
Copy Paste: [[2505.20026]] Gradient Inversion Transcript: Leveraging Robust Generative Priors to Reconstruct Training Data from Gradient Leakage(https://arxiv.org/abs/2505.20026)
Keywords: attack, robust, generative
Abstract: We propose Gradient Inversion Transcript (GIT), a novel generative approach for reconstructing training data from leaked gradients. GIT employs a generative attack model, whose architecture is tailored to align with the structure of the leaked model based on theoretical analysis. Once trained offline, GIT can be deployed efficiently and only relies on the leaked gradients to reconstruct the input data, rendering it applicable under various distributed learning environments. When used as a prior for other iterative optimization-based methods, GIT not only accelerates convergence but also enhances the overall reconstruction quality. GIT consistently outperforms existing methods across multiple datasets and demonstrates strong robustness under challenging conditions, including inaccurate gradients, data distribution shifts and discrepancies in model parameters.

Title: ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.20032
Pdf URL: https://arxiv.org/pdf/2505.20032
Copy Paste: [[2505.20032]] ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers(https://arxiv.org/abs/2505.20032)
Keywords: robust, transformer
Abstract: Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: this https URL

Title: EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Maurice Kraus, Felix Friedrich, Huu Nguyen, Krishna Kalyan, Kourosh Nadi, Kristian Kersting, Sören Auer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20033
Pdf URL: https://arxiv.org/pdf/2505.20033
Copy Paste: [[2505.20033]] EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition(https://arxiv.org/abs/2505.20033)
Keywords: robust
Abstract: Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We build Empathic Insight Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.

Title: Graph Wave Networks

Authors: Juwei Yue, Haikuo Li, Jiawei Sheng, Yihan Guo, Xinghua Zhang, Chuan Zhou, Tingwen Liu, Li Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20034
Pdf URL: https://arxiv.org/pdf/2505.20034
Copy Paste: [[2505.20034]] Graph Wave Networks(https://arxiv.org/abs/2505.20034)
Keywords: diffusion
Abstract: Dynamics modeling has been introduced as a novel paradigm in message passing (MP) of graph neural networks (GNNs). Existing methods consider MP between nodes as a heat diffusion process, and leverage heat equation to model the temporal evolution of nodes in the embedding space. However, heat equation can hardly depict the wave nature of graph signals in graph signal processing. Besides, heat equation is essentially a partial differential equation (PDE) involving a first partial derivative of time, whose numerical solution usually has low stability, and leads to inefficient model training. In this paper, we would like to depict more wave details in MP, since graph signals are essentially wave signals that can be seen as a superposition of a series of waves in the form of eigenvector. This motivates us to consider MP as a wave propagation process to capture the temporal evolution of wave signals in the space. Based on wave equation in physics, we innovatively develop a graph wave equation to leverage the wave propagation on graphs. In details, we demonstrate that the graph wave equation can be connected to traditional spectral GNNs, facilitating the design of graph wave networks based on various Laplacians and enhancing the performance of the spectral GNNs. Besides, the graph wave equation is particularly a PDE involving a second partial derivative of time, which has stronger stability on graphs than the heat equation that involves a first partial derivative of time. Additionally, we theoretically prove that the numerical solution derived from the graph wave equation are constantly stable, enabling to significantly enhance model efficiency while ensuring its performance. Extensive experiments show that GWNs achieve SOTA and efficient performance on benchmark datasets, and exhibit outstanding performance in addressing challenging graph problems, such as over-smoothing and heterophily.

Title: Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

Authors: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Soudy, Sara Ossman, Abdallah Amr, Nehal Adel Abdelsalam, Mohamed Elkerdawy, Ahmed Elnaggar
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.20036
Pdf URL: https://arxiv.org/pdf/2505.20036
Copy Paste: [[2505.20036]] Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction(https://arxiv.org/abs/2505.20036)
Keywords: robust, fair
Abstract: Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

Title: Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs

Authors: Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, Artem Shelmanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20045
Pdf URL: https://arxiv.org/pdf/2505.20045
Copy Paste: [[2505.20045]] Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs(https://arxiv.org/abs/2505.20045)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations". Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain "uncertainty-aware" heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (<1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.

Title: Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Authors: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
Subjects: cs.CL, cs.AI, cs.LO, cs.SE
Abstract URL: https://arxiv.org/abs/2505.20047
Pdf URL: https://arxiv.org/pdf/2505.20047
Copy Paste: [[2505.20047]] Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks(https://arxiv.org/abs/2505.20047)
Keywords: large language model
Abstract: Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

Title: Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks

Authors: Ali Forootani, Mohammad Khosravi
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2505.20048
Pdf URL: https://arxiv.org/pdf/2505.20048
Copy Paste: [[2505.20048]] Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks(https://arxiv.org/abs/2505.20048)
Keywords: robust, interpretability, transformer
Abstract: Time series forecasting plays a critical role in domains such as energy, finance, and healthcare, where accurate predictions inform decision-making under uncertainty. Although Transformer-based models have demonstrated success in sequential modeling, their adoption for time series remains limited by challenges such as noise sensitivity, long-range dependencies, and a lack of inductive bias for temporal structure. In this work, we present a unified and principled framework for benchmarking three prominent Transformer forecasting architectures-Autoformer, Informer, and Patchtst-each evaluated through three architectural variants: Minimal, Standard, and Full, representing increasing levels of complexity and modeling capacity. We conduct over 1500 controlled experiments on a suite of ten synthetic signals, spanning five patch lengths and five forecast horizons under both clean and noisy conditions. Our analysis reveals consistent patterns across model families. To advance this landscape further, we introduce the Koopman-enhanced Transformer framework, Deep Koopformer, which integrates operator-theoretic latent state modeling to improve stability and interpretability. We demonstrate its efficacy on nonlinear and chaotic dynamical systems. Our results highlight Koopman based Transformer as a promising hybrid approach for robust, interpretable, and theoretically grounded time series forecasting in noisy and complex real-world conditions.

Title: Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

Authors: Hongsong Wang, Ao Sun, Jie Gui, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20049
Pdf URL: https://arxiv.org/pdf/2505.20049
Copy Paste: [[2505.20049]] Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay(https://arxiv.org/abs/2505.20049)
Keywords: robust, data-free
Abstract: Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on this https URL.

Title: Catoni-Style Change Point Detection for Regret Minimization in Non-Stationary Heavy-Tailed Bandits

Authors: Gianmarco Genalti, Sujay Bhatt, Nicola Gatti, Alberto Maria Metelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20051
Pdf URL: https://arxiv.org/pdf/2505.20051
Copy Paste: [[2505.20051]] Catoni-Style Change Point Detection for Regret Minimization in Non-Stationary Heavy-Tailed Bandits(https://arxiv.org/abs/2505.20051)
Keywords: robust
Abstract: Regret minimization in stochastic non-stationary bandits gained popularity over the last decade, as it can model a broad class of real-world problems, from advertising to recommendation systems. Existing literature relies on various assumptions about the reward-generating process, such as Bernoulli or subgaussian rewards. However, in settings such as finance and telecommunications, heavy-tailed distributions naturally arise. In this work, we tackle the heavy-tailed piecewise-stationary bandit problem. Heavy-tailed bandits, introduced by Bubeck et al., 2013, operate on the minimal assumption that the finite absolute centered moments of maximum order $1+\epsilon$ are uniformly bounded by a constant $v<+\infty$, for some $\epsilon \in (0,1]$. We focus on the most popular non-stationary bandit setting, i.e., the piecewise-stationary setting, in which the mean of reward-generating distributions may change at unknown time steps. We provide a novel Catoni-style change-point detection strategy tailored for heavy-tailed distributions that relies on recent advancements in the theory of sequential estimation, which is of independent interest. We introduce Robust-CPD-UCB, which combines this change-point detection strategy with optimistic algorithms for bandits, providing its regret upper bound and an impossibility result on the minimum attainable regret for any policy. Finally, we validate our approach through numerical experiments on synthetic and real-world datasets.

Title: Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

Authors: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Elkerdawy, Ahmed Elnaggar
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.20052
Pdf URL: https://arxiv.org/pdf/2505.20052
Copy Paste: [[2505.20052]] Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations(https://arxiv.org/abs/2505.20052)
Keywords: robust
Abstract: Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

Title: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Authors: Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20053
Pdf URL: https://arxiv.org/pdf/2505.20053
Copy Paste: [[2505.20053]] Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion(https://arxiv.org/abs/2505.20053)
Keywords: diffusion, generative, large language model
Abstract: Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

Title: PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation

Authors: Hongsong Wang, Yin Zhu, Qiuxia Lai, Yang Zhang, Guo-Sen Xie, Xin Geng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20056
Pdf URL: https://arxiv.org/pdf/2505.20056
Copy Paste: [[2505.20056]] PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation(https://arxiv.org/abs/2505.20056)
Keywords: diffusion
Abstract: Computational dance generation is crucial in many areas, such as art, human-computer interaction, virtual reality, and digital entertainment, particularly for generating coherent and expressive long dance sequences. Diffusion-based music-to-dance generation has made significant progress, yet existing methods still struggle to produce physically plausible motions. To address this, we propose Plausibility-Aware Motion Diffusion (PAMD), a framework for generating dances that are both musically aligned and physically realistic. The core of PAMD lies in the Plausible Motion Constraint (PMC), which leverages Neural Distance Fields (NDFs) to model the actual pose manifold and guide generated motions toward a physically valid pose manifold. To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG), which uses standing poses as auxiliary conditions alongside music features. To further enhance realism for complex movements, we introduce the Motion Refinement with Foot-ground Contact (MRFC) module, which addresses foot-skating artifacts by bridging the gap between the optimization objective in linear joint position space and the data representation in nonlinear rotation space. Extensive experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions. This project page is available at: this https URL.

Title: SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Authors: Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Moontae Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20065
Pdf URL: https://arxiv.org/pdf/2505.20065
Copy Paste: [[2505.20065]] SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety(https://arxiv.org/abs/2505.20065)
Keywords: large language model
Abstract: As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.

Title: Incentivizing Reasoning from Weak Supervision

Authors: Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20072
Pdf URL: https://arxiv.org/pdf/2505.20072
Copy Paste: [[2505.20072]] Incentivizing Reasoning from Weak Supervision(https://arxiv.org/abs/2505.20072)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at this https URL.

Title: An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks

Authors: Jinyan Wang, Liu Yang, Yuecen Wei, Jiaxuan Si, Chenhao Guo, Qingyun Sun, Xianxian Li, Xingcheng Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20074
Pdf URL: https://arxiv.org/pdf/2505.20074
Copy Paste: [[2505.20074]] An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks(https://arxiv.org/abs/2505.20074)
Keywords: privacy, attack, membership infer
Abstract: Graph Neural Network-based methods face privacy leakage risks due to the introduction of topological structures about the targets, which allows attackers to bypass the target's prior knowledge of the sensitive attributes and realize membership inference attacks (MIA) by observing and analyzing the topology distribution. As privacy concerns grow, the assumption of MIA, which presumes that attackers can obtain an auxiliary dataset with the same distribution, is increasingly deviating from reality. In this paper, we categorize the distribution diversity issue in real-world MIA scenarios as an Out-Of-Distribution (OOD) problem, and propose a novel Graph OOD Membership Inference Attack (GOOD-MIA) to achieve cross-domain graph attacks. Specifically, we construct shadow subgraphs with distributions from different domains to model the diversity of real-world data. We then explore the stable node representations that remain unchanged under external influences and consider eliminating redundant information from confounding environments and extracting task-relevant key information to more clearly distinguish between the characteristics of training data and unseen data. This OOD-based design makes cross-domain graph attacks possible. Finally, we perform risk extrapolation to optimize the attack's domain adaptability during attack inference to generalize the attack to other domains. Experimental results demonstrate that GOOD-MIA achieves superior attack performance in datasets designed for multiple domains.

Title: Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Authors: Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20076
Pdf URL: https://arxiv.org/pdf/2505.20076
Copy Paste: [[2505.20076]] Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior(https://arxiv.org/abs/2505.20076)
Keywords: interpretability, transformer
Abstract: Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

Title: Inference-time Alignment in Continuous Space

Authors: Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20081
Pdf URL: https://arxiv.org/pdf/2505.20081
Copy Paste: [[2505.20081]] Inference-time Alignment in Continuous Space(https://arxiv.org/abs/2505.20081)
Keywords: large language model
Abstract: Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at this https URL

Title: Multi-Domain Explainability of Preferences

Authors: Nitay Calderon, Liat Ein-Dor, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20088
Pdf URL: https://arxiv.org/pdf/2505.20088
Copy Paste: [[2505.20088]] Multi-Domain Explainability of Preferences(https://arxiv.org/abs/2505.20088)
Keywords: explainability, large language model
Abstract: Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.

Title: Spurious Privacy Leakage in Neural Networks

Authors: Chenxiang Zhang, Jun Pang, Sjouke Mauw
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20095
Pdf URL: https://arxiv.org/pdf/2505.20095
Copy Paste: [[2505.20095]] Spurious Privacy Leakage in Neural Networks(https://arxiv.org/abs/2505.20095)
Keywords: privacy, attack, robust, steal
Abstract: Neural networks are vulnerable to privacy attacks aimed at stealing sensitive data. The risks can be amplified in a real-world scenario, particularly when models are trained on limited and biased data. In this work, we investigate the impact of spurious correlation bias on privacy vulnerability. We introduce \emph{spurious privacy leakage}, a phenomenon where spurious groups are significantly more vulnerable to privacy attacks than non-spurious groups. We further show that group privacy disparity increases in tasks with simpler objectives (e.g. fewer classes) due to the persistence of spurious features. Surprisingly, we find that reducing spurious correlation using spurious robust methods does not mitigate spurious privacy leakage. This leads us to introduce a perspective on privacy disparity based on memorization, where mitigating spurious correlation does not mitigate the memorization of spurious data, and therefore, neither the privacy level. Lastly, we compare the privacy of different model architectures trained with spurious data, demonstrating that, contrary to prior works, architectural choice can affect privacy outcomes.

Title: MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Authors: Thang Nguyen, Peter Chin, Yu-Wing Tai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20096
Pdf URL: https://arxiv.org/pdf/2505.20096
Copy Paste: [[2505.20096]] MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.20096)
Keywords: robust, extraction
Abstract: We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.

Title: S2LPP: Small-to-Large Prompt Prediction across LLMs

Authors: Liang Cheng, Tianyi LI, Zhaowei Wang, Mark Steedman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20097
Pdf URL: https://arxiv.org/pdf/2505.20097
Copy Paste: [[2505.20097]] S2LPP: Small-to-Large Prompt Prediction across LLMs(https://arxiv.org/abs/2505.20097)
Keywords: robust, large language model
Abstract: The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness

Title: Transformer in Protein: A Survey

Authors: Xiaowen Ling, Zhiqiang Li, Yanbin Wang, Zhuhong You
Subjects: cs.LG, cs.CR, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.20098
Pdf URL: https://arxiv.org/pdf/2505.20098
Copy Paste: [[2505.20098]] Transformer in Protein: A Survey(https://arxiv.org/abs/2505.20098)
Keywords: transformer
Abstract: As protein informatics advances rapidly, the demand for enhanced predictive accuracy, structural analysis, and functional understanding has intensified. Transformer models, as powerful deep learning architectures, have demonstrated unprecedented potential in addressing diverse challenges across protein research. However, a comprehensive review of Transformer applications in this field remains lacking. This paper bridges this gap by surveying over 100 studies, offering an in-depth analysis of practical implementations and research progress of Transformers in protein-related tasks. Our review systematically covers critical domains, including protein structure prediction, function prediction, protein-protein interaction analysis, functional annotation, and drug discovery/target identification. To contextualize these advancements across various protein domains, we adopt a domain-oriented classification system. We first introduce foundational concepts: the Transformer architecture and attention mechanisms, categorize Transformer variants tailored for protein science, and summarize essential protein knowledge. For each research domain, we outline its objectives and background, critically evaluate prior methods and their limitations, and highlight transformative contributions enabled by Transformer models. We also curate and summarize pivotal datasets and open-source code resources to facilitate reproducibility and benchmarking. Finally, we discuss persistent challenges in applying Transformers to protein informatics and propose future research directions. This review aims to provide a consolidated foundation for the synergistic integration of Transformer and protein informatics, fostering further innovation and expanded applications in the field.

Title: Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Authors: Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20099
Pdf URL: https://arxiv.org/pdf/2505.20099
Copy Paste: [[2505.20099]] Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities(https://arxiv.org/abs/2505.20099)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG's role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.

Title: AdaTP: Attention-Debiased Token Pruning for Video Large Language Models

Authors: Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20100
Pdf URL: https://arxiv.org/pdf/2505.20100
Copy Paste: [[2505.20100]] AdaTP: Attention-Debiased Token Pruning for Video Large Language Models(https://arxiv.org/abs/2505.20100)
Keywords: large language model
Abstract: Video Large Language Models (Video LLMs) have achieved remarkable results in video understanding tasks. However, they often suffer from heavy computational overhead due to the large number of visual tokens generated from multiple video frames. Existing visual token compression methods often rely on attention scores from language models as guidance. However, these scores exhibit inherent biases: global bias reflects a tendency to focus on the two ends of the visual token sequence, while local bias leads to an over-concentration on the same spatial positions across different frames. To address the issue of attention bias, we propose $\textbf{A}$ttention-$\textbf{D}$ebi$\textbf{a}$sed $\textbf{T}$oken $\textbf{P}$runing for Video Large Language Models ($\textbf{AdaTP}$), a novel token pruning pipeline for Video LLMs. AdaTP integrates two dedicated debiasing modules into the pipeline, targeting global attention bias and local attention bias, respectively. Without the need for additional training, our method significantly reduces the computational overhead of Video LLMs while retaining the performance of vanilla models. Extensive evaluation shows that AdaTP achieves state-of-the-art performance in various commonly used video understanding benchmarks. In particular, on LLaVA-OneVision-7B, AdaTP maintains performance without degradation while using only up to $27.3\%$ FLOPs compared to the vanilla model. Our code will be released soon.

Title: Adaptive Deep Reasoning: Triggering Deep Thinking When Needed

Authors: Yunhao Wang, Yuhao Zhang, Tinghao Yu, Can Xu, Feng Zhang, Fengzong Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20101
Pdf URL: https://arxiv.org/pdf/2505.20101
Copy Paste: [[2505.20101]] Adaptive Deep Reasoning: Triggering Deep Thinking When Needed(https://arxiv.org/abs/2505.20101)
Keywords: large language model
Abstract: Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long this http URL this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model's initial token choice, thereby guiding the selection of the reasoning this http URL on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.

Title: From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20106
Pdf URL: https://arxiv.org/pdf/2505.20106
Copy Paste: [[2505.20106]] From Data to Modeling: Fully Open-vocabulary Scene Graph Generation(https://arxiv.org/abs/2505.20106)
Keywords: transformer
Abstract: We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

Title: Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Authors: Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20107
Pdf URL: https://arxiv.org/pdf/2505.20107
Copy Paste: [[2505.20107]] Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning(https://arxiv.org/abs/2505.20107)
Keywords: diffusion
Abstract: Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.

Title: Language-Agnostic Suicidal Risk Detection Using Large Language Models

Authors: June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20109
Pdf URL: https://arxiv.org/pdf/2505.20109
Copy Paste: [[2505.20109]] Language-Agnostic Suicidal Risk Detection Using Large Language Models(https://arxiv.org/abs/2505.20109)
Keywords: robust, large language model
Abstract: Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.

Title: Proxy-Free GFlowNet

Authors: Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20110
Pdf URL: https://arxiv.org/pdf/2505.20110
Copy Paste: [[2505.20110]] Proxy-Free GFlowNet(https://arxiv.org/abs/2505.20110)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) are a promising class of generative models designed to sample diverse, high-reward structures by modeling distributions over compositional objects. In many real-world applications, obtaining the reward function for such objects is expensive, time-consuming, or requires human input, making it necessary to train GFlowNets from historical datasets. Most existing methods adopt a model-based approach, learning a proxy model from the dataset to approximate the reward function. However, this strategy inherently ties the quality of the learned policy to the accuracy of the proxy, introducing additional complexity and uncertainty into the training process. To overcome these limitations, we propose \textbf{Trajectory-Distilled GFlowNet (TD-GFN)}, a \emph{proxy-free} training framework that eliminates the need for out-of-dataset reward queries. Our method is motivated by the key observation that different edges in the associated directed acyclic graph (DAG) contribute unequally to effective policy learning. TD-GFN leverages inverse reinforcement learning to estimate edge-level rewards from the offline dataset, which are then used to ingeniously prune the DAG and guide backward trajectory sampling during training. This approach directs the policy toward high-reward regions while reducing the complexity of model fitting. Empirical results across multiple tasks show that TD-GFN trains both efficiently and reliably, significantly outperforming existing baselines in convergence speed and sample quality.

Title: ResSVD: Residual Compensated SVD for Large Language Model Compression

Authors: Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20112
Pdf URL: https://arxiv.org/pdf/2505.20112
Copy Paste: [[2505.20112]] ResSVD: Residual Compensated SVD for Large Language Model Compression(https://arxiv.org/abs/2505.20112)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed this http URL evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

Title: Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone

Authors: Cristian Santini, Laura Melosi, Emanuele Frontoni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20113
Pdf URL: https://arxiv.org/pdf/2505.20113
Copy Paste: [[2505.20113]] Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone(https://arxiv.org/abs/2505.20113)
Keywords: robust, extraction, large language model
Abstract: The increased digitization of world's textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi's Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.

Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Authors: Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.20118
Pdf URL: https://arxiv.org/pdf/2505.20118
Copy Paste: [[2505.20118]] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent(https://arxiv.org/abs/2505.20118)
Keywords: privacy, attack, large language model
Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

Title: Understanding Generalization in Diffusion Models via Probability Flow Distance

Authors: Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20123
Pdf URL: https://arxiv.org/pdf/2505.20123
Copy Paste: [[2505.20123]] Understanding Generalization in Diffusion Models via Probability Flow Distance(https://arxiv.org/abs/2505.20123)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ($\texttt{PFD}$), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, $\texttt{PFD}$ quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using $\texttt{PFD}$ under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.

Title: TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos

Authors: Fanheng Kong, Jingyuan Zhang, Hongzhi Zhang, Shi Feng, Daling Wang, Linhao Yu, Xingguang Ji, Yu Tian, Qi Wang, Fuzheng Zhang
Subjects: cs.CV, cs.DB, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20124
Pdf URL: https://arxiv.org/pdf/2505.20124
Copy Paste: [[2505.20124]] TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos(https://arxiv.org/abs/2505.20124)
Keywords: robust
Abstract: Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models. The data and code are available at this https URL.

Title: Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

Authors: Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, Zhaochun Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20128
Pdf URL: https://arxiv.org/pdf/2505.20128
Copy Paste: [[2505.20128]] Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers(https://arxiv.org/abs/2505.20128)
Keywords: large language model
Abstract: Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.

Title: MolEditRL: Structure-Preserving Molecular Editing via Discrete Diffusion and Reinforcement Learning

Authors: Yuanxin Zhuang, Dazhong Shen, Ying Sun
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.20131
Pdf URL: https://arxiv.org/pdf/2505.20131
Copy Paste: [[2505.20131]] MolEditRL: Structure-Preserving Molecular Editing via Discrete Diffusion and Reinforcement Learning(https://arxiv.org/abs/2505.20131)
Keywords: diffusion
Abstract: Molecular editing aims to modify a given molecule to optimize desired chemical properties while preserving structural similarity. However, current approaches typically rely on string-based or continuous representations, which fail to adequately capture the discrete, graph-structured nature of molecules, resulting in limited structural fidelity and poor controllability. In this paper, we propose MolEditRL, a molecular editing framework that explicitly integrates structural constraints with precise property optimization. Specifically, MolEditRL consists of two stages: (1) a discrete graph diffusion model pretrained to reconstruct target molecules conditioned on source structures and natural language instructions; (2) an editing-aware reinforcement learning fine-tuning stage that further enhances property alignment and structural preservation by explicitly optimizing editing decisions under graph constraints. For comprehensive evaluation, we construct MolEdit-Instruct, the largest and most property-rich molecular editing dataset, comprising 3 million diverse examples spanning single- and multi-property tasks across 10 chemical attributes. Experimental results demonstrate that MolEditRL significantly outperforms state-of-the-art methods in both property optimization accuracy and structural fidelity, achieving a 74\% improvement in editing success rate while using 98\% fewer parameters.

Title: Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks

Authors: Safa Hamreras, Sukhbinder Singh, Román Orús
Subjects: cs.LG, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2505.20132
Pdf URL: https://arxiv.org/pdf/2505.20132
Copy Paste: [[2505.20132]] Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks(https://arxiv.org/abs/2505.20132)
Keywords: interpretability
Abstract: Tensorizing a neural network involves reshaping some or all of its dense weight matrices into higher-order tensors and approximating them using low-rank tensor network decompositions. This technique has shown promise as a model compression strategy for large-scale neural networks. However, despite encouraging empirical results, tensorized neural networks (TNNs) remain underutilized in mainstream deep learning. In this position paper, we offer a perspective on both the potential and current limitations of TNNs. We argue that TNNs represent a powerful yet underexplored framework for deep learning--one that deserves greater attention from both engineering and theoretical communities. Beyond compression, we highlight the value of TNNs as a flexible class of architectures with distinctive scaling properties and increased interpretability. A central feature of TNNs is the presence of bond indices, which introduce new latent spaces not found in conventional networks. These internal representations may provide deeper insight into the evolution of features across layers, potentially advancing the goals of mechanistic interpretability. We conclude by outlining several key research directions aimed at overcoming the practical barriers to scaling and adopting TNNs in modern deep learning workflows.

Title: SeMe: Training-Free Language Model Merging via Semantic Alignment

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20144
Pdf URL: https://arxiv.org/pdf/2505.20144
Copy Paste: [[2505.20144]] SeMe: Training-Free Language Model Merging via Semantic Alignment(https://arxiv.org/abs/2505.20144)
Keywords: robust, data-free
Abstract: Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.

Title: FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Authors: Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20147
Pdf URL: https://arxiv.org/pdf/2505.20147
Copy Paste: [[2505.20147]] FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities(https://arxiv.org/abs/2505.20147)
Keywords: large language model
Abstract: The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

Title: UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models

Authors: Xueyan Zhang, Jinman Zhao, Zhifei Yang, Yibo Zhong, Shuhao Guan, Linbo Cao, Yining Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20154
Pdf URL: https://arxiv.org/pdf/2505.20154
Copy Paste: [[2505.20154]] UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models(https://arxiv.org/abs/2505.20154)
Keywords: large language model
Abstract: This paper introduces Uniform Orthogonal Reinitialization Adaptation (UORA), a novel parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). UORA achieves state-of-the-art performance and parameter efficiency by leveraging a low-rank approximation method to reduce the number of trainable parameters. Unlike existing methods such as LoRA and VeRA, UORA employs an interpolation-based reparametrization mechanism that selectively reinitializes rows and columns in frozen projection matrices, guided by the vector magnitude heuristic. This results in substantially fewer trainable parameters compared to LoRA and outperforms VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UORA's superiority in achieving competitive fine-tuning performance with negligible computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and its effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.

Title: Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs

Authors: Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, Huiling Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang (and Other Contributors)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20155
Pdf URL: https://arxiv.org/pdf/2505.20155
Copy Paste: [[2505.20155]] Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs(https://arxiv.org/abs/2505.20155)
Keywords: large language model
Abstract: Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.

Title: HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

Authors: Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20156
Pdf URL: https://arxiv.org/pdf/2505.20156
Copy Paste: [[2505.20156]] HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters(https://arxiv.org/abs/2505.20156)
Keywords: diffusion, transformer
Abstract: Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.

Title: Exploring Generative Error Correction for Dysarthric Speech Recognition

Authors: Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.20163
Pdf URL: https://arxiv.org/pdf/2505.20163
Copy Paste: [[2505.20163]] Exploring Generative Error Correction for Dysarthric Speech Recognition(https://arxiv.org/abs/2505.20163)
Keywords: generative
Abstract: Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition

Title: Visual Abstract Thinking Empowers Multimodal Reasoning

Authors: Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20164
Pdf URL: https://arxiv.org/pdf/2505.20164
Copy Paste: [[2505.20164]] Visual Abstract Thinking Empowers Multimodal Reasoning(https://arxiv.org/abs/2505.20164)
Keywords: large language model
Abstract: Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.

Title: Long-Context State-Space Video World Models

Authors: Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, Xun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20171
Pdf URL: https://arxiv.org/pdf/2505.20171
Copy Paste: [[2505.20171]] Long-Context State-Space Video World Models(https://arxiv.org/abs/2505.20171)
Keywords: diffusion
Abstract: Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.

Title: "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Authors: Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis
Subjects: cs.CL, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2505.20176
Pdf URL: https://arxiv.org/pdf/2505.20176
Copy Paste: [[2505.20176]] "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding(https://arxiv.org/abs/2505.20176)
Keywords: transformer
Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.

Title: THiNK: Can Large Language Models Think-aloud?

Authors: Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20184
Pdf URL: https://arxiv.org/pdf/2505.20184
Copy Paste: [[2505.20184]] THiNK: Can Large Language Models Think-aloud?(https://arxiv.org/abs/2505.20184)
Keywords: large language model
Abstract: Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

Title: Eradicating the Unseen: Detecting, Exploiting, and Remediating a Path Traversal Vulnerability across GitHub

Authors: Jafar Akhoundali, Hamidreza Hamidi, Kristian Rietveld, Olga Gadyatskaya
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.20186
Pdf URL: https://arxiv.org/pdf/2505.20186
Copy Paste: [[2505.20186]] Eradicating the Unseen: Detecting, Exploiting, and Remediating a Path Traversal Vulnerability across GitHub(https://arxiv.org/abs/2505.20186)
Keywords: secure, attack
Abstract: Vulnerabilities in open-source software can cause cascading effects in the modern digital ecosystem. It is especially worrying if these vulnerabilities repeat across many projects, as once the adversaries find one of them, they can scale up the attack very easily. Unfortunately, since developers frequently reuse code from their own or external code resources, some nearly identical vulnerabilities exist across many open-source projects. We conducted a study to examine the prevalence of a particular vulnerable code pattern that enables path traversal attacks (CWE-22) across open-source GitHub projects. To handle this study at the GitHub scale, we developed an automated pipeline that scans GitHub for the targeted vulnerable pattern, confirms the vulnerability by first running a static analysis and then exploiting the vulnerability in the context of the studied project, assesses its impact by calculating the CVSS score, generates a patch using GPT-4, and reports the vulnerability to the maintainers. Using our pipeline, we identified 1,756 vulnerable open-source projects, some of which are very influential. For many of the affected projects, the vulnerability is critical (CVSS score higher than 9.0), as it can be exploited remotely without any privileges and critically impact the confidentiality and availability of the system. We have responsibly disclosed the vulnerability to the maintainers, and 14\% of the reported vulnerabilities have been remediated. We also investigated the root causes of the vulnerable code pattern and assessed the side effects of the large number of copies of this vulnerable pattern that seem to have poisoned several popular LLMs. Our study highlights the urgent need to help secure the open-source ecosystem by leveraging scalable automated vulnerability management solutions and raising awareness among developers.

Title: FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement

Authors: Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20192
Pdf URL: https://arxiv.org/pdf/2505.20192
Copy Paste: [[2505.20192]] FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement(https://arxiv.org/abs/2505.20192)
Keywords: large language model
Abstract: The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub this https URL

Title: Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

Authors: Pengxiang Li, Shilin Yan, Joey Tsai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20199
Pdf URL: https://arxiv.org/pdf/2505.20199
Copy Paste: [[2505.20199]] Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking(https://arxiv.org/abs/2505.20199)
Keywords: diffusion, generative
Abstract: Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model's instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG's corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.

Title: Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

Authors: Mohit Chandra, Siddharth Sriraman, Harneet Singh Khanuja, Yiqiao Jin, Munmun De Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20201
Pdf URL: https://arxiv.org/pdf/2505.20201
Copy Paste: [[2505.20201]] Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations(https://arxiv.org/abs/2505.20201)
Keywords: large language model
Abstract: Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.

Title: PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

Authors: Jiabo Ma, Yingxue Xu, Fengtao Zhou, Yihui Wang, Cheng Jin, Zhengrui Guo, Jianfeng Wu, On Ki Tang, Huajun Zhou, Xi Wang, Luyang Luo, Zhengyu Zhang, Du Cai, Zizhao Gao, Wei Wang, Yueping Liu, Jiankun He, Jing Cui, Zhenhui Li, Jing Zhang, Feng Gao, Xiuming Zhang, Li Liang, Ronald Cheong Kin Chan, Zhe Wang, Hao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20202
Pdf URL: https://arxiv.org/pdf/2505.20202
Copy Paste: [[2505.20202]] PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology(https://arxiv.org/abs/2505.20202)
Keywords: robust
Abstract: The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

Title: How to Improve the Robustness of Closed-Source Models on NLI

Authors: Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, Marek Rei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20209
Pdf URL: https://arxiv.org/pdf/2505.20209
Copy Paste: [[2505.20209]] How to Improve the Robustness of Closed-Source Models on NLI(https://arxiv.org/abs/2505.20209)
Keywords: robust, large language model
Abstract: Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improve performance, but this often results in the models learning from dataset-specific heuristics that reduce their robustness on out-of-distribution (OOD) data. Existing methods to improve robustness either perform poorly, or are non-applicable to closed-source models because they assume access to model internals, or the ability to change the model's training procedure. In this work, we investigate strategies to improve the robustness of closed-source LLMs through data-centric methods that do not require access to model internals. We find that the optimal strategy depends on the complexity of the OOD data. For highly complex OOD datasets, upsampling more challenging training examples can improve robustness by up to 1.5%. For less complex OOD datasets, replacing a portion of the training set with LLM-generated examples can improve robustness by 3.7%. More broadly, we find that large-scale closed-source autoregressive LLMs are substantially more robust than commonly used encoder models, and are a more appropriate choice of baseline going forward.

Title: Parameter-Efficient Fine-Tuning with Column Space Projection

Authors: Junseo Hwang, Wonguk Cho, Taesup Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20211
Pdf URL: https://arxiv.org/pdf/2505.20211
Copy Paste: [[2505.20211]] Parameter-Efficient Fine-Tuning with Column Space Projection(https://arxiv.org/abs/2505.20211)
Keywords: large language model
Abstract: Fine-tuning large language models (LLMs) with minimal computational overhead is essential for efficiently adapting them to downstream tasks under resource constraints. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), facilitate this by updating only a small subset of parameters. However, recent studies show that LoRA diverges from full fine-tuning (Full FT) in its learning behavior, particularly in terms of spectral properties. Motivated by these findings, we propose PiCa, the first theoretically grounded PEFT method based on the spectral properties of fine-tuned weights. PiCa projects gradients onto the low-rank column subspace of pre-trained weights and exhibits learning patterns more closely aligned with Full FT. Furthermore, we show that combining PiCa with weight sharing drastically reduces the number of trainable parameters without compromising performance, enabling to achieve superior performance than LoRA using 13x fewer trainable parameters. Extensive experiments demonstrate PiCa achieves the state-of-the-art performance compared to existing PEFT methods.

Title: Dependency Parsing is More Parameter-Efficient with Normalization

Authors: Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20215
Pdf URL: https://arxiv.org/pdf/2505.20215
Copy Paste: [[2505.20215]] Dependency Parsing is More Parameter-Efficient with Normalization(https://arxiv.org/abs/2505.20215)
Keywords: transformer
Abstract: Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on six datasets for semantic and syntactic dependency parsing using a one-hop parser. We train N-layer stacked BiLSTMs and evaluate the parser's performance with and without normalizing biaffine scores. Normalizing allows us to beat the state of the art on two datasets, with fewer samples and trainable parameters. Code: this https URL

Title: Fine-grained List-wise Alignment for Generative Medication Recommendation

Authors: Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20218
Pdf URL: https://arxiv.org/pdf/2505.20218
Copy Paste: [[2505.20218]] Fine-grained List-wise Alignment for Generative Medication Recommendation(https://arxiv.org/abs/2505.20218)
Keywords: generative, large language model
Abstract: Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at this https URL.

Title: Gradient Flow Matching for Learning Update Dynamics in Neural Network Training

Authors: Xiao Shou, Yanna Ding, Jianxi Gao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20221
Pdf URL: https://arxiv.org/pdf/2505.20221
Copy Paste: [[2505.20221]] Gradient Flow Matching for Learning Update Dynamics in Neural Network Training(https://arxiv.org/abs/2505.20221)
Keywords: transformer
Abstract: Training deep neural networks remains computationally intensive due to the itera2 tive nature of gradient-based optimization. We propose Gradient Flow Matching (GFM), a continuous-time modeling framework that treats neural network training as a dynamical system governed by learned optimizer-aware vector fields. By leveraging conditional flow matching, GFM captures the underlying update rules of optimizers such as SGD, Adam, and RMSprop, enabling smooth extrapolation of weight trajectories toward convergence. Unlike black-box sequence models, GFM incorporates structural knowledge of gradient-based updates into the learning objective, facilitating accurate forecasting of final weights from partial training sequences. Empirically, GFM achieves forecasting accuracy that is competitive with Transformer-based models and significantly outperforms LSTM and other classical baselines. Furthermore, GFM generalizes across neural architectures and initializations, providing a unified framework for studying optimization dynamics and accelerating convergence prediction.

Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Authors: Hao Kang, Zichun Yu, Chenyan Xiong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20225
Pdf URL: https://arxiv.org/pdf/2505.20225
Copy Paste: [[2505.20225]] FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models(https://arxiv.org/abs/2505.20225)
Keywords: large language model
Abstract: Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at this https URL.

Title: From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Authors: Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20229
Pdf URL: https://arxiv.org/pdf/2505.20229
Copy Paste: [[2505.20229]] From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance(https://arxiv.org/abs/2505.20229)
Keywords: robust, extraction, interpretability, transformer
Abstract: Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at this https URL.

Title: Multimodal Federated Learning With Missing Modalities through Feature Imputation Network

Authors: Pranav Poudel, Aavash Chhetri, Prashnna Gyawali, Georgios Leontidis, Binod Bhattarai
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20232
Pdf URL: https://arxiv.org/pdf/2505.20232
Copy Paste: [[2505.20232]] Multimodal Federated Learning With Missing Modalities through Feature Imputation Network(https://arxiv.org/abs/2505.20232)
Keywords: privacy, federate, generative
Abstract: Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: this https URL

Title: Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

Authors: Weihao Xuan, Qingcheng Zeng, Heli Qi, Junjue Wang, Naoto Yokoya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20236
Pdf URL: https://arxiv.org/pdf/2505.20236
Copy Paste: [[2505.20236]] Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models(https://arxiv.org/abs/2505.20236)
Keywords: large language model
Abstract: Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

Title: DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

Authors: Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20241
Pdf URL: https://arxiv.org/pdf/2505.20241
Copy Paste: [[2505.20241]] DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning(https://arxiv.org/abs/2505.20241)
Keywords: large language model
Abstract: Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.

Title: RedAHD: Reduction-Based End-to-End Automatic Heuristic Design with Large Language Models

Authors: Nguyen Thach, Aida Riahifar, Nathan Huynh, Hau Chan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20242
Pdf URL: https://arxiv.org/pdf/2505.20242
Copy Paste: [[2505.20242]] RedAHD: Reduction-Based End-to-End Automatic Heuristic Design with Large Language Models(https://arxiv.org/abs/2505.20242)
Keywords: large language model
Abstract: Solving NP-hard combinatorial optimization problems (COPs) (e.g., traveling salesman problems (TSPs) and capacitated vehicle routing problems (CVRPs)) in practice traditionally involves handcrafting heuristics or specifying a search space for finding effective heuristics. The main challenges from these approaches, however, are the sheer amount of domain knowledge and implementation efforts required from human experts. Recently, significant progress has been made to address these challenges, particularly by using large language models (LLMs) to design heuristics within some predetermined generalized algorithmic framework (GAF, e.g., ant colony optimization and guided local search) for building key functions/components (e.g., a priori information on how promising it is to include each edge in a solution for TSP and CVRP). Although existing methods leveraging this idea have shown to yield impressive optimization performance, they are not fully end-to-end and still require considerable manual interventions. In this paper, we propose a novel end-to-end framework, named RedAHD, that enables these LLM-based heuristic design methods to operate without the need of GAFs. More specifically, RedAHD employs LLMs to automate the process of reduction, i.e., transforming the COP at hand into similar COPs that are better-understood, from which LLM-based heuristic design methods can design effective heuristics for directly solving the transformed COPs and, in turn, indirectly solving the original COP. Our experimental results, evaluated on six COPs, show that RedAHD is capable of designing heuristics with competitive or improved results over the state-of-the-art methods with minimal human involvement.

Title: It's High Time: A Survey of Temporal Information Retrieval and Question Answering

Authors: Bhawna Piryani, Abdelrahman Abdullah, Jamshid Mozafari, Avishek Anand, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20243
Pdf URL: https://arxiv.org/pdf/2505.20243
Copy Paste: [[2505.20243]] It's High Time: A Survey of Temporal Information Retrieval and Question Answering(https://arxiv.org/abs/2505.20243)
Keywords: robust, transformer, large language model
Abstract: Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.

Title: KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing

Authors: Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20245
Pdf URL: https://arxiv.org/pdf/2505.20245
Copy Paste: [[2505.20245]] KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing(https://arxiv.org/abs/2505.20245)
Keywords: large language model
Abstract: Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.

Title: WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Authors: Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20249
Pdf URL: https://arxiv.org/pdf/2505.20249
Copy Paste: [[2505.20249]] WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models(https://arxiv.org/abs/2505.20249)
Keywords: protect, large language model
Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

Title: Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Authors: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20254
Pdf URL: https://arxiv.org/pdf/2505.20254
Copy Paste: [[2505.20254]] Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs(https://arxiv.org/abs/2505.20254)
Keywords: robust, interpretability
Abstract: Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

Title: AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models

Authors: Muyao Niu, Mingdeng Cao, Yifan Zhan, Qingtian Zhu, Mingze Ma, Jiancheng Zhao, Yanhong Zeng, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20255
Pdf URL: https://arxiv.org/pdf/2505.20255
Copy Paste: [[2505.20255]] AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models(https://arxiv.org/abs/2505.20255)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce $\textbf{AniCrafter}$, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative "avatar-background" conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes will be available at this https URL.

Title: Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Authors: Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20256
Pdf URL: https://arxiv.org/pdf/2505.20256
Copy Paste: [[2505.20256]] Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration(https://arxiv.org/abs/2505.20256)
Keywords: segmentation
Abstract: Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Title: Lifelong Safety Alignment for Language Models

Authors: Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
Subjects: cs.CR, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20259
Pdf URL: https://arxiv.org/pdf/2505.20259
Copy Paste: [[2505.20259]] Lifelong Safety Alignment for Language Models(https://arxiv.org/abs/2505.20259)
Keywords: defense, attack, robust
Abstract: LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at this https URL.

Title: HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes

Authors: Changjian Jiang, Kerui Ren, Linning Xu, Jiong Chen, Jiangmiao Pang, Yu Zhang, Bo Dai, Mulin Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20267
Pdf URL: https://arxiv.org/pdf/2505.20267
Copy Paste: [[2505.20267]] HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes(https://arxiv.org/abs/2505.20267)
Keywords: robust
Abstract: High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles for geometry with Gaussian primitives for appearance, motivated by the lightweight classic geometry representations and their proven efficiency in real world applications. Our design yields a compact yet expressive model capable of photo realistic rendering across both indoor and outdoor environments, seamlessly adapting to varying levels of scene complexity. Experiments on multiple benchmark datasets demonstrate that our method yields both compact, accurate geometry and high fidelity renderings, especially in challenging scenarios where robust geometric structure make a clear difference.

Title: In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

Authors: Yu Xu, Fan Tang, You Wu, Lin Gao, Oliver Deussen, Hongbin Yan, Jintao Li, Juan Cao, Tong-Yee Lee
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2505.20271
Pdf URL: https://arxiv.org/pdf/2505.20271
Copy Paste: [[2505.20271]] In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation(https://arxiv.org/abs/2505.20271)
Keywords: diffusion
Abstract: Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.

Title: Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Authors: Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20272
Pdf URL: https://arxiv.org/pdf/2505.20272
Copy Paste: [[2505.20272]] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.20272)
Keywords: interpretability
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

Title: ImgEdit: A Unified Image Editing Dataset and Benchmark

Authors: Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20275
Pdf URL: https://arxiv.org/pdf/2505.20275
Copy Paste: [[2505.20275]] ImgEdit: A Unified Image Editing Dataset and Benchmark(https://arxiv.org/abs/2505.20275)
Keywords: generative, segmentation
Abstract: Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on this https URL.

Title: Does quantization affect models' performance on long-context tasks?

Authors: Anmol Mekala, Anirudh Atmakuru, Yixiao Song, Marzena Karpinska, Mohit Iyyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20276
Pdf URL: https://arxiv.org/pdf/2505.20276
Copy Paste: [[2505.20276]] Does quantization affect models' performance on long-context tasks?(https://arxiv.org/abs/2505.20276)
Keywords: robust, large language model
Abstract: Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.

Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

Authors: Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20277
Pdf URL: https://arxiv.org/pdf/2505.20277
Copy Paste: [[2505.20277]] OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction(https://arxiv.org/abs/2505.20277)
Keywords: large language model
Abstract: Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at this https URL.

Title: The Coverage Principle: A Framework for Understanding Compositional Generalization

Authors: Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20278
Pdf URL: https://arxiv.org/pdf/2505.20278
Copy Paste: [[2505.20278]] The Coverage Principle: A Framework for Understanding Compositional Generalization(https://arxiv.org/abs/2505.20278)
Keywords: transformer, large language model
Abstract: Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

Title: VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20279
Pdf URL: https://arxiv.org/pdf/2505.20279
Copy Paste: [[2505.20279]] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction(https://arxiv.org/abs/2505.20279)
Keywords: robust
Abstract: The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.

Title: One-shot Entropy Minimization

Authors: Zitian Gao, Lynx Chen, Joey Zhou, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20282
Pdf URL: https://arxiv.org/pdf/2505.20282
Copy Paste: [[2505.20282]] One-shot Entropy Minimization(https://arxiv.org/abs/2505.20282)
Keywords: large language model
Abstract: We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at this https URL.

Title: MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

Authors: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20285
Pdf URL: https://arxiv.org/pdf/2505.20285
Copy Paste: [[2505.20285]] MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability(https://arxiv.org/abs/2505.20285)
Keywords: generative, large language model
Abstract: Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MASKSEARCH. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MASKSEARCH significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.

Title: Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Authors: Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20288
Pdf URL: https://arxiv.org/pdf/2505.20288
Copy Paste: [[2505.20288]] Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots(https://arxiv.org/abs/2505.20288)
Keywords: diffusion, transformer, generative
Abstract: Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at this https URL.

Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Authors: Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20292
Pdf URL: https://arxiv.org/pdf/2505.20292
Copy Paste: [[2505.20292]] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation(https://arxiv.org/abs/2505.20292)
Keywords: robust
Abstract: Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

Title: Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery

Authors: Yifan Sun, Danding Wang, Qiang Sheng, Juan Cao, Jintao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20293
Pdf URL: https://arxiv.org/pdf/2505.20293
Copy Paste: [[2505.20293]] Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery(https://arxiv.org/abs/2505.20293)
Keywords: large language model
Abstract: Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbf{ECO-Concept}, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.

Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?

Authors: Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20295
Pdf URL: https://arxiv.org/pdf/2505.20295
Copy Paste: [[2505.20295]] Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?(https://arxiv.org/abs/2505.20295)
Keywords: large language model
Abstract: To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.

Title: Reasoning LLMs are Wandering Solution Explorers

Authors: Jiahao Lu, Ziwei Xu, Mohan Kankanhalli
Subjects: cs.CL, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20296
Pdf URL: https://arxiv.org/pdf/2505.20296
Copy Paste: [[2505.20296]] Reasoning LLMs are Wandering Solution Explorers(https://arxiv.org/abs/2505.20296)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.

Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation

Authors: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20297
Pdf URL: https://arxiv.org/pdf/2505.20297
Copy Paste: [[2505.20297]] DiSA: Diffusion Step Annealing in Autoregressive Image Generation(https://arxiv.org/abs/2505.20297)
Keywords: diffusion
Abstract: An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.