2025-04-15

Title: Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Authors: Ruoxin Xiong, Yanyu Wang, Suat Gunhan, Yimin Zhu, Charles Berryman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08779
Pdf URL: https://arxiv.org/pdf/2504.08779
Copy Paste: [[2504.08779]] Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams(https://arxiv.org/abs/2504.08779)
Keywords: large language model
Abstract: The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.

Title: Efficient Evaluation of Large Language Models via Collaborative Filtering

Authors: Xu-Xiang Zhong, Chao Yi, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08781
Pdf URL: https://arxiv.org/pdf/2504.08781
Copy Paste: [[2504.08781]] Efficient Evaluation of Large Language Models via Collaborative Filtering(https://arxiv.org/abs/2504.08781)
Keywords: large language model
Abstract: With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.

Title: Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

Authors: Lucas Beerens, Desmond J. Higham
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08782
Pdf URL: https://arxiv.org/pdf/2504.08782
Copy Paste: [[2504.08782]] Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models(https://arxiv.org/abs/2504.08782)
Keywords: security, defense, attack, robust, steal, diffusion, generative
Abstract: We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at this https URL .

Title: Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Authors: Toqeer Ehsan, Thamar Solorio
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08792
Pdf URL: https://arxiv.org/pdf/2504.08792
Copy Paste: [[2504.08792]] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation(https://arxiv.org/abs/2504.08792)
Keywords: generative, large language model
Abstract: Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Title: Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks

Authors: Xiaomei Zhang, Zhaoxi Zhang, Yanjun Zhang, Xufei Zheng, Leo Yu Zhang, Shengshan Hu, Shirui Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08798
Pdf URL: https://arxiv.org/pdf/2504.08798
Copy Paste: [[2504.08798]] Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks(https://arxiv.org/abs/2504.08798)
Keywords: attack
Abstract: Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.

Title: Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

Authors: Andrew Kiruluta, Priscilla Burity, Samantha Williams
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08801
Pdf URL: https://arxiv.org/pdf/2504.08801
Copy Paste: [[2504.08801]] Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention(https://arxiv.org/abs/2504.08801)
Keywords: interpretability, transformer
Abstract: Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.

Title: InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data

Authors: David R. Johnson, Smita Krishnaswamy, Michael Perlmutter
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08802
Pdf URL: https://arxiv.org/pdf/2504.08802
Copy Paste: [[2504.08802]] InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data(https://arxiv.org/abs/2504.08802)
Keywords: diffusion
Abstract: Diffusion wavelets extract information from graph signals at different scales of resolution by utilizing graph diffusion operators raised to various powers, known as diffusion scales. Traditionally, the diffusion scales are chosen to be dyadic integers, $\mathbf{2^j}$. Here, we propose a novel, unsupervised method for selecting the diffusion scales based on ideas from information theory. We then show that our method can be incorporated into wavelet-based GNNs via graph classification experiments.

Title: A temporal scale transformer framework for precise remaining useful life prediction in fuel cells

Authors: Zezhi Tang, Xiaoyu Chen, Xin Jin, Benyuan Zhang, Wenyu Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08803
Pdf URL: https://arxiv.org/pdf/2504.08803
Copy Paste: [[2504.08803]] A temporal scale transformer framework for precise remaining useful life prediction in fuel cells(https://arxiv.org/abs/2504.08803)
Keywords: extraction, transformer
Abstract: In exploring Predictive Health Management (PHM) strategies for Proton Exchange Membrane Fuel Cells (PEMFC), the Transformer model, widely used in data-driven approaches, excels in many fields but struggles with time series analysis due to its self-attention mechanism, which yields a complexity of the input sequence squared and low computational efficiency. It also faces challenges in capturing both global long-term dependencies and local details effectively. To tackle this, we propose the Temporal Scale Transformer (TSTransformer), an enhanced version of the inverted Transformer (iTransformer). Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling, using attention to capture multivariate correlations and feed-forward networks (FFN) to encode sequence representations. By integrating a one-dimensional convolutional layer into the multivariate attention for multi-level scaling of K and V matrices, it improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs. Experiments comparing TSTransformer with models like Long Short-Term Memory, iTransformer, and Transformer demonstrate its potential as a powerful tool for advancing PHM in renewable energy, effectively addressing the limitations of pure Transformer models in data-driven time series tasks.

Title: Generative AI in Live Operations: Evidence of Productivity Gains in Cybersecurity and Endpoint Management

Authors: James Bono, Justin Grana, Kleanthis Karakolios, Pruthvi Hanumanthapura Ramakrishna, Ankit Srivastava
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08805
Pdf URL: https://arxiv.org/pdf/2504.08805
Copy Paste: [[2504.08805]] Generative AI in Live Operations: Evidence of Productivity Gains in Cybersecurity and Endpoint Management(https://arxiv.org/abs/2504.08805)
Keywords: security, protect, robust, generative
Abstract: We measure the association between generative AI (GAI) tool adoption and four metrics spanning security operations, information protection, and endpoint management: 1) number of security alerts per incident, 2) probability of security incident reopenings, 3) time to classify a data loss prevention alert, and 4) time to resolve device policy conflicts. We find that GAI is associated with robust and statistically and practically significant improvements in the four metrics. Although unobserved confounders inhibit causal identification, these results are among the first to use observational data from live operations to investigate the relationship between GAI adoption and security operations, data loss prevention, and device policy management.

Title: Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models

Authors: Zhengke Sun, Hangwei Qian, Ivor Tsang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08808
Pdf URL: https://arxiv.org/pdf/2504.08808
Copy Paste: [[2504.08808]] Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models(https://arxiv.org/abs/2504.08808)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at this https URL.

Title: Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Authors: Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08809
Pdf URL: https://arxiv.org/pdf/2504.08809
Copy Paste: [[2504.08809]] Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models(https://arxiv.org/abs/2504.08809)
Keywords: robust, large language model
Abstract: Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

Title: PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration

Authors: Zheyuan Lai, Yingming Pu
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08810
Pdf URL: https://arxiv.org/pdf/2504.08810
Copy Paste: [[2504.08810]] PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration(https://arxiv.org/abs/2504.08810)
Keywords: interpretability
Abstract: Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard-interpretability. To bridge this gap, we introduce a principles-guided material discovery system powered by language inferential multi-agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated-and-transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \href{this https URL}{GitHub}.

Title: Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization

Authors: Zirui Chen, Zhaoyang Zhang, Ziqing Xing, Ridong Li, Zhaohui Yang, Richeng Jin, Chongwen Huang, Yuzhi Yang, Mérouane Debbah
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2504.08811
Pdf URL: https://arxiv.org/pdf/2504.08811
Copy Paste: [[2504.08811]] Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization(https://arxiv.org/abs/2504.08811)
Keywords: robust
Abstract: Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is mainly due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite the data of each scenario has its distinct reference frame, its generation generally follows the same underlying physical rule. Based on these findings, this article proposes a brand-new universal deep learning framework named analogical learning (AL), which provides a highly efficient way to implicitly retrieve the reference frame information associated with a scenario and then to make accurate prediction by relative analogy across scenarios. Specifically, an elegant bipartite neural network architecture called Mateformer is designed, the first part of which calculates the relativity within multiple feature spaces between the input data and a small amount of embedded data from the current scenario, while the second part uses these relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments show that AL achieves state-of-the-art accuracy, stable transferability and robust adaptation to new scenarios without any tuning, and outperforming conventional methods with a precision improvement of nearly two orders of magnitude. All data and code are available at this https URL.

Title: SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Authors: Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, Tat-Seng Chua
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08813
Pdf URL: https://arxiv.org/pdf/2504.08813
Copy Paste: [[2504.08813]] SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models(https://arxiv.org/abs/2504.08813)
Keywords: attack
Abstract: The rapid advancement of multi-modal large reasoning models (MLRMs) -- enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities -- has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates -- far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction -- 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs' self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.

Title: From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting

Authors: Xinyu Zhang, Shanshan Feng, Xutao Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08818
Pdf URL: https://arxiv.org/pdf/2504.08818
Copy Paste: [[2504.08818]] From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting(https://arxiv.org/abs/2504.08818)
Keywords: large language model
Abstract: Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. However, the effectiveness of LLM backbones in this domain remains a topic of debate. Based on thorough empirical analyses, we observe that training and testing LLM-based models on small datasets often leads to the Encoder and Decoder becoming overly adapted to the dataset, thereby obscuring the true predictive capabilities of the LLM backbone. To investigate the genuine potential of LLMs in time series prediction, we introduce three pre-training models with identical architectures but different pre-training strategies. Thereby, large-scale pre-training allows us to create unbiased Encoder and Decoder components tailored to the LLM backbone. Through controlled experiments, we evaluate the zero-shot and few-shot prediction performance of the LLM, offering insights into its capabilities. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited. Our source code is publicly available in the anonymous repository: this https URL.

Title: CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization

Authors: Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08820
Pdf URL: https://arxiv.org/pdf/2504.08820
Copy Paste: [[2504.08820]] CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization(https://arxiv.org/abs/2504.08820)
Keywords: large language model
Abstract: As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.

Title: Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics

Authors: Enming Zhang, Zheng Liu, Yu Xiang, Yanwen Qu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08821
Pdf URL: https://arxiv.org/pdf/2504.08821
Copy Paste: [[2504.08821]] Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics(https://arxiv.org/abs/2504.08821)
Keywords: diffusion
Abstract: Active QoS metric prediction, commonly employed in the maintenance and operation of DTN, could enhance network performance regarding latency, throughput, energy consumption, and dependability. Naturally formulated as a multivariate time series forecasting problem, it attracts substantial research efforts. Traditional mean regression methods for time series forecasting cannot capture the data complexity adequately, resulting in deteriorated performance in operational tasks in DTNs such as routing. This paper formulates the prediction of QoS metrics in DTN as a probabilistic forecasting problem on multivariate time series, where one could quantify the uncertainty of forecasts by characterizing the distribution of these samples. The proposed approach hires diffusion models and incorporates the latent temporal dynamics of non-stationary and multi-mode data into them. Extensive experiments demonstrate the efficacy of the proposed approach by showing that it outperforms the popular probabilistic time series forecasting methods.

Title: FM-LoRA: Factorized Low-Rank Meta-Prompting for Continual Learning

Authors: Xiaobing Yu, Jin Yang, Xiao Wu, Peijie Qiu, Xiaofeng Liu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.08823
Pdf URL: https://arxiv.org/pdf/2504.08823
Copy Paste: [[2504.08823]] FM-LoRA: Factorized Low-Rank Meta-Prompting for Continual Learning(https://arxiv.org/abs/2504.08823)
Keywords: robust, transformer
Abstract: How to adapt a pre-trained model continuously for sequential tasks with different prediction class labels and domains and finally learn a generalizable model across diverse tasks is a long-lasting challenge. Continual learning (CL) has emerged as a promising approach to leverage pre-trained models (e.g., Transformers) for sequential tasks. While many existing CL methods incrementally store additional learned structures, such as Low-Rank Adaptation (LoRA) adapters or prompts and sometimes even preserve features from previous samples to maintain performance. This leads to unsustainable parameter growth and escalating storage costs as the number of tasks increases. Moreover, current approaches often lack task similarity awareness, which further hinders the models ability to effectively adapt to new tasks without interfering with previously acquired knowledge. To address these challenges, we propose FM-LoRA, a novel and efficient low-rank adaptation method that integrates both a dynamic rank selector (DRS) and dynamic meta-prompting (DMP). This framework allocates model capacity more effectively across tasks by leveraging a shared low-rank subspace critical for preserving knowledge, thereby avoiding continual parameter expansion. Extensive experiments on various CL benchmarks, including ImageNet-R, CIFAR100, and CUB200 for class-incremental learning (CIL), and DomainNet for domain-incremental learning (DIL), with Transformers backbone demonstrate that FM-LoRA effectively mitigates catastrophic forgetting while delivering robust performance across a diverse range of tasks and domains.

Title: PatchTrAD: A Patch-Based Transformer focusing on Patch-Wise Reconstruction Error for Time Series Anomaly Detection

Authors: Samy-Melwan Vilhes (LITIS), Gilles Gasso (LITIS), Mokhtar Z Alaya (LMAC)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08827
Pdf URL: https://arxiv.org/pdf/2504.08827
Copy Paste: [[2504.08827]] PatchTrAD: A Patch-Based Transformer focusing on Patch-Wise Reconstruction Error for Time Series Anomaly Detection(https://arxiv.org/abs/2504.08827)
Keywords: transformer
Abstract: Time series anomaly detection (TSAD) focuses on identifying whether observations in streaming data deviate significantly from normal patterns. With the prevalence of connected devices, anomaly detection on time series has become paramount, as it enables real-time monitoring and early detection of irregular behaviors across various application domains. In this work, we introduce PatchTrAD, a Patch-based Transformer model for time series anomaly detection. Our approach leverages a Transformer encoder along with the use of patches under a reconstructionbased framework for anomaly detection. Empirical evaluations on multiple benchmark datasets show that PatchTrAD is on par, in terms of detection performance, with state-of-the-art deep learning models for anomaly detection while being time efficient during inference.

Title: Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

Authors: G. Charbel N. Kindji (IRISA, MALT), Elisa Fromont (MALT, IRISA), Lina Maria Rojas-Barahona, Tanguy Urvoy
Subjects: cs.LG, cs.AI, cs.DB, cs.NE
Abstract URL: https://arxiv.org/abs/2504.08829
Pdf URL: https://arxiv.org/pdf/2504.08829
Copy Paste: [[2504.08829]] Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild(https://arxiv.org/abs/2504.08829)
Keywords: robust, transformer, generative
Abstract: The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.

Title: SD$^2$: Self-Distilled Sparse Drafters

Authors: Mike Lasby, Nish Sinnadurai, Valavan Manohararajah, Sean Lie, Vithursan Thangarasa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08838
Pdf URL: https://arxiv.org/pdf/2504.08838
Copy Paste: [[2504.08838]] SD$^2$: Self-Distilled Sparse Drafters(https://arxiv.org/abs/2504.08838)
Keywords: large language model
Abstract: Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $\times$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

Title: Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories

Authors: Vasiliki Tassopoulou, Haochang Shou, Christos Davatzikos
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08840
Pdf URL: https://arxiv.org/pdf/2504.08840
Copy Paste: [[2504.08840]] Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories(https://arxiv.org/abs/2504.08840)
Keywords: robust
Abstract: Longitudinal biomedical studies monitor individuals over time to capture dynamics in brain development, disease progression, and treatment effects. However, estimating trajectories of brain biomarkers is challenging due to biological variability, inconsistencies in measurement protocols (e.g., differences in MRI scanners), scarcity, and irregularity in longitudinal measurements. Herein, we introduce a novel personalized deep kernel regression framework for forecasting brain biomarkers, with application to regional volumetric measurements. Our approach integrates two key components: a population model that captures brain trajectories from a large and diverse cohort, and a subject-specific model that captures individual trajectories. To optimally combine these, we propose Adaptive Shrinkage Estimation, which effectively balances population and subject-specific models. We assess our model's performance through predictive accuracy metrics, uncertainty quantification, and validation against external clinical studies. Benchmarking against state-of-the-art statistical and machine learning models -- including linear mixed effects models, generalized additive models, and deep learning methods -- demonstrates the superior predictive performance of our approach. Additionally, we apply our method to predict trajectories of composite neuroimaging biomarkers, which highlights the versatility of our approach in modeling the progression of longitudinal neuroimaging biomarkers. Furthermore, validation on three external neuroimaging studies confirms the robustness of our method across different clinical contexts. We make the code available at this https URL.

Title: Towards Combinatorial Interpretability of Neural Computation

Authors: Micah Adler, Dan Alistarh, Nir Shavit
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2504.08842
Pdf URL: https://arxiv.org/pdf/2504.08842
Copy Paste: [[2504.08842]] Towards Combinatorial Interpretability of Neural Computation(https://arxiv.org/abs/2504.08842)
Keywords: interpretability
Abstract: We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network's feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. Feature channel coding reframes the superposition hypothesis, shifting the focus from neuron activation directionality in high-dimensional space to the combinatorial structure of codes. It also allows us for the first time to exactly quantify and explain the relationship between a network's parameter size and its computational capacity (i.e. the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws. Though our initial studies of feature channel coding are restricted to Boolean functions, we believe they provide a rich, controlled, and informative research space, and that the path we propose for combinatorial interpretation of neural computation can provide a basis for understanding both artificial and biological neural circuits.

Title: X-Guard: Multilingual Guard Agent for Content Moderation

Authors: Bibek Upadhayay, Vahid Behzadan, Ph.D
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08848
Pdf URL: https://arxiv.org/pdf/2504.08848
Copy Paste: [[2504.08848]] X-Guard: Multilingual Guard Agent for Content Moderation(https://arxiv.org/abs/2504.08848)
Keywords: protect, attack, robust, large language model
Abstract: Large Language Models (LLMs) have rapidly become integral to numerous applications in critical domains where reliability is paramount. Despite significant advances in safety frameworks and guardrails, current protective measures exhibit crucial vulnerabilities, particularly in multilingual contexts. Existing safety systems remain susceptible to adversarial attacks in low-resource languages and through code-switching techniques, primarily due to their English-centric design. Furthermore, the development of effective multilingual guardrails is constrained by the scarcity of diverse cross-lingual training data. Even recent solutions like Llama Guard-3, while offering multilingual support, lack transparency in their decision-making processes. We address these challenges by introducing X-Guard agent, a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts. X-Guard effectively defends against both conventional low-resource language attacks and sophisticated code-switching attacks. Our approach includes: curating and enhancing multiple open-source safety datasets with explicit evaluation rationales; employing a jury of judges methodology to mitigate individual judge LLM provider biases; creating a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points; and developing a two-stage architecture combining a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model trained through supervised finetuning and GRPO training. Our empirical evaluations demonstrate X-Guard's effectiveness in detecting unsafe content across multiple languages while maintaining transparency throughout the safety evaluation process. Our work represents a significant advancement in creating robust, transparent, and linguistically inclusive safety systems for LLMs and its integrated systems.

Title: Mimic In-Context Learning for Multimodal Tasks

Authors: Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08851
Pdf URL: https://arxiv.org/pdf/2504.08851
Copy Paste: [[2504.08851]] Mimic In-Context Learning for Multimodal Tasks(https://arxiv.org/abs/2504.08851)
Keywords: transformer
Abstract: Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as ``shift vectors'' added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at this https URL.

Title: ML For Hardware Design Interpretability: Challenges and Opportunities

Authors: Raymond Baartmans, Andrew Ensinger, Victor Agostinelli, Lizhong Chen
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2504.08852
Pdf URL: https://arxiv.org/pdf/2504.08852
Copy Paste: [[2504.08852]] ML For Hardware Design Interpretability: Challenges and Opportunities(https://arxiv.org/abs/2504.08852)
Keywords: interpretability, large language model
Abstract: The increasing size and complexity of machine learning (ML) models have driven the growing need for custom hardware accelerators capable of efficiently supporting ML workloads. However, the design of such accelerators remains a time-consuming process, heavily relying on engineers to manually ensure design interpretability through clear documentation and effective communication. Recent advances in large language models (LLMs) offer a promising opportunity to automate these design interpretability tasks, particularly the generation of natural language descriptions for register-transfer level (RTL) code, what we refer to as "RTL-to-NL tasks." In this paper, we examine how design interpretability, particularly in RTL-to-NL tasks, influences the efficiency of the hardware design process. We review existing work adapting LLMs for these tasks, highlight key challenges that remain unaddressed, including those related to data, computation, and model development, and identify opportunities to address them. By doing so, we aim to guide future research in leveraging ML to automate RTL-to-NL tasks and improve hardware design interpretability, thereby accelerating the hardware design process and meeting the increasing demand for custom hardware accelerators in machine learning and beyond.

Title: Hardware Design and Security Needs Attention: From Survey to Path Forward

Authors: Sujan Ghimire, Muhtasim Alam Chowdhury, Banafsheh Saber Latibari, Muntasir Mamun, Jaeden Wolf Carpenter, Benjamin Tan, Hammond Pearce, Pratik Satam, Soheil Salehi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08854
Pdf URL: https://arxiv.org/pdf/2504.08854
Copy Paste: [[2504.08854]] Hardware Design and Security Needs Attention: From Survey to Path Forward(https://arxiv.org/abs/2504.08854)
Keywords: security, large language model
Abstract: Recent advances in attention-based artificial intelligence (AI) models have unlocked vast potential to automate digital hardware design while enhancing and strengthening security measures against various threats. This rapidly emerging field leverages Large Language Models (LLMs) to generate HDL code, identify vulnerabilities, and sometimes mitigate them. The state of the art in this design automation space utilizes optimized LLMs with HDL datasets, creating automated systems for register-transfer level (RTL) generation, verification, and debugging, and establishing LLM-driven design environments for streamlined logic designs. Additionally, attention-based models like graph attention have shown promise in chip design applications, including floorplanning. This survey investigates the integration of these models into hardware-related domains, emphasizing logic design and hardware security, with or without the use of IP libraries. This study explores the commercial and academic landscape, highlighting technical hurdles and future prospects for automating hardware design and security. Moreover, it provides new insights into the study of LLM-driven design systems, advances in hardware security mechanisms, and the impact of influential works on industry practices. Through the examination of 30 representative approaches and illustrative case studies, this paper underscores the transformative potential of attention-based models in revolutionizing hardware design while addressing the challenges that lie ahead in this interdisciplinary domain.

Title: On Transfer-based Universal Attacks in Pure Black-box Setting

Authors: Mohammad A.A.K. Jalwana, Naveed Akhtar, Ajmal Mian, Nazanin Rahnavard, Mubarak Shah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08866
Pdf URL: https://arxiv.org/pdf/2504.08866
Copy Paste: [[2504.08866]] On Transfer-based Universal Attacks in Pure Black-box Setting(https://arxiv.org/abs/2504.08866)
Keywords: attack
Abstract: Despite their impressive performance, deep visual models are susceptible to transferable black-box adversarial attacks. Principally, these attacks craft perturbations in a target model-agnostic manner. However, surprisingly, we find that existing methods in this domain inadvertently take help from various priors that violate the black-box assumption such as the availability of the dataset used to train the target model, and the knowledge of the number of classes in the target model. Consequently, the literature fails to articulate the true potency of transferable black-box attacks. We provide an empirical study of these biases and propose a framework that aids in a prior-free transparent study of this paradigm. Using our framework, we analyze the role of prior knowledge of the target model data and number of classes in attack performance. We also provide several interesting insights based on our analysis, and demonstrate that priors cause overestimation in transferability scores. Finally, we extend our framework to query-based attacks. This extension inspires a novel image-blending technique to prepare data for effective surrogate model training.

Title: An LLM Framework For Cryptography Over Chat Channels

Authors: Danilo Gligoroski, Mayank Raikwar, Sonu Kumar Jha
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08871
Pdf URL: https://arxiv.org/pdf/2504.08871
Copy Paste: [[2504.08871]] An LLM Framework For Cryptography Over Chat Channels(https://arxiv.org/abs/2504.08871)
Keywords: secure, large language model
Abstract: Recent advancements in Large Language Models (LLMs) have transformed communication, yet their role in secure messaging remains underexplored, especially in surveillance-heavy environments. At the same time, many governments all over the world are proposing legislation to detect, backdoor, or even ban encrypted communication. That emphasizes the need for alternative ways to communicate securely and covertly over open channels. We propose a novel cryptographic embedding framework that enables covert Public Key or Symmetric Key encrypted communication over public chat channels with humanlike produced texts. Some unique properties of our framework are: 1. It is LLM agnostic, i.e., it allows participants to use different local LLM models independently; 2. It is pre- or post-quantum agnostic; 3. It ensures indistinguishability from human-like chat-produced texts. Thus, it offers a viable alternative where traditional encryption is detectable and restricted.

Title: Personalizing Federated Learning for Hierarchical Edge Networks with Non-IID Data

Authors: Seunghyun Lee, Omid Tavallaie, Shuaijun Chen, Kanchana Thilakarathna, Suranga Seneviratne, Adel Nadjaran Toosi, Albert Y. Zomaya
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2504.08872
Pdf URL: https://arxiv.org/pdf/2504.08872
Copy Paste: [[2504.08872]] Personalizing Federated Learning for Hierarchical Edge Networks with Non-IID Data(https://arxiv.org/abs/2504.08872)
Keywords: privacy, robust, federate
Abstract: Accommodating edge networks between IoT devices and the cloud server in Hierarchical Federated Learning (HFL) enhances communication efficiency without compromising data privacy. However, devices connected to the same edge often share geographic or contextual similarities, leading to varying edge-level data heterogeneity with different subsets of labels per edge, on top of device-level heterogeneity. This hierarchical non-Independent and Identically Distributed (non-IID) nature, which implies that each edge has its own optimization goal, has been overlooked in HFL research. Therefore, existing edge-accommodated HFL demonstrates inconsistent performance across edges in various hierarchical non-IID scenarios. To ensure robust performance with diverse edge-level non-IID data, we propose a Personalized Hierarchical Edge-enabled Federated Learning (PHE-FL), which personalizes each edge model to perform well on the unique class distributions specific to each edge. We evaluated PHE-FL across 4 scenarios with varying levels of edge-level non-IIDness, with extreme IoT device level non-IIDness. To accurately assess the effectiveness of our personalization approach, we deployed test sets on each edge server instead of the cloud server, and used both balanced and imbalanced test sets. Extensive experiments show that PHE-FL achieves up to 83 percent higher accuracy compared to existing federated learning approaches that incorporate edge networks, given the same number of training rounds. Moreover, PHE-FL exhibits improved stability, as evidenced by reduced accuracy fluctuations relative to the state-of-the-art FedAvg with two-level (edge and cloud) aggregation.

Title: Distilling and exploiting quantitative insights from Large Language Models for enhanced Bayesian optimization of chemical reactions

Authors: Roshan Patel, Saeed Moayedpour, Louis De Lescure, Lorenzo Kogler-Anele, Alan Cherney, Sven Jager, Yasser Jangjou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08874
Pdf URL: https://arxiv.org/pdf/2504.08874
Copy Paste: [[2504.08874]] Distilling and exploiting quantitative insights from Large Language Models for enhanced Bayesian optimization of chemical reactions(https://arxiv.org/abs/2504.08874)
Keywords: large language model
Abstract: Machine learning and Bayesian optimization (BO) algorithms can significantly accelerate the optimization of chemical reactions. Transfer learning can bolster the effectiveness of BO algorithms in low-data regimes by leveraging pre-existing chemical information or data outside the direct optimization task (i.e., source data). Large language models (LLMs) have demonstrated that chemical information present in foundation training data can give them utility for processing chemical data. Furthermore, they can be augmented with and help synthesize potentially multiple modalities of source chemical data germane to the optimization task. In this work, we examine how chemical information from LLMs can be elicited and used for transfer learning to accelerate the BO of reaction conditions to maximize yield. Specifically, we show that a survey-like prompting scheme and preference learning can be used to infer a utility function which models prior chemical information embedded in LLMs over a chemical parameter space; we find that the utility function shows modest correlation to true experimental measurements (yield) over the parameter space despite operating in a zero-shot setting. Furthermore, we show that the utility function can be leveraged to focus BO efforts in promising regions of the parameter space, improving the yield of the initial BO query and enhancing optimization in 4 of the 6 datasets studied. Overall, we view this work as a step towards bridging the gap between the chemistry knowledge embedded in LLMs and the capabilities of principled BO methods to accelerate reaction optimization.

Title: Knowledge Graph-extended Retrieval Augmented Generation for Question Answering

Authors: Jasper Linders, Jakub M. Tomczak
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08893
Pdf URL: https://arxiv.org/pdf/2504.08893
Copy Paste: [[2504.08893]] Knowledge Graph-extended Retrieval Augmented Generation for Question Answering(https://arxiv.org/abs/2504.08893)
Keywords: robust, explainability, large language model
Abstract: Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG's potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.

Title: Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries

Authors: Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08896
Pdf URL: https://arxiv.org/pdf/2504.08896
Copy Paste: [[2504.08896]] Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries(https://arxiv.org/abs/2504.08896)
Keywords: large language model
Abstract: In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation this http URL, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.

Title: Toward Spiking Neural Network Local Learning Modules Resistant to Adversarial Attacks

Authors: Jiaqi Lin, Abhronil Sengupta
Subjects: cs.LG, cs.CR, cs.NE
Abstract URL: https://arxiv.org/abs/2504.08897
Pdf URL: https://arxiv.org/pdf/2504.08897
Copy Paste: [[2504.08897]] Toward Spiking Neural Network Local Learning Modules Resistant to Adversarial Attacks(https://arxiv.org/abs/2504.08897)
Keywords: attack, robust
Abstract: Recent research has shown the vulnerability of Spiking Neural Networks (SNNs) under adversarial examples that are nearly indistinguishable from clean data in the context of frame-based and event-based information. The majority of these studies are constrained in generating adversarial examples using Backpropagation Through Time (BPTT), a gradient-based method which lacks biological plausibility. In contrast, local learning methods, which relax many of BPTT's constraints, remain under-explored in the context of adversarial attacks. To address this problem, we examine adversarial robustness in SNNs through the framework of four types of training algorithms. We provide an in-depth analysis of the ineffectiveness of gradient-based adversarial attacks to generate adversarial instances in this scenario. To overcome these limitations, we introduce a hybrid adversarial attack paradigm that leverages the transferability of adversarial instances. The proposed hybrid approach demonstrates superior performance, outperforming existing adversarial attack methods. Furthermore, the generalizability of the method is assessed under multi-step adversarial attacks, adversarial attacks in black-box FGSM scenarios, and within the non-spiking domain.

Title: LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Authors: Pascal Chang, Sergio Sancho, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08902
Pdf URL: https://arxiv.org/pdf/2504.08902
Copy Paste: [[2504.08902]] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping(https://arxiv.org/abs/2504.08902)
Keywords: generative
Abstract: Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.

Title: Robust SAM: On the Adversarial Robustness of Vision Foundation Models

Authors: Jiahuan Long, Zhengqin Xu, Tingsong Jiang, Wen Yao, Shuai Jia, Chao Ma, Xiaoqian Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08906
Pdf URL: https://arxiv.org/pdf/2504.08906
Copy Paste: [[2504.08906]] Robust SAM: On the Adversarial Robustness of Vision Foundation Models(https://arxiv.org/abs/2504.08906)
Keywords: defense, attack, robust, segmentation
Abstract: The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.

Title: HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules

Authors: Neil He, Menglin Yang, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08912
Pdf URL: https://arxiv.org/pdf/2504.08912
Copy Paste: [[2504.08912]] HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules(https://arxiv.org/abs/2504.08912)
Keywords: transformer
Abstract: Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data across diverse modalities. Recent studies show that token distributions in foundation models exhibit scale-free properties, suggesting that hyperbolic space is a more suitable ambient space than Euclidean space for many pre-training and downstream tasks. However, existing tools lack essential components for building hyperbolic foundation models, making it difficult to leverage recent advancements. We introduce HyperCore, a comprehensive open-source framework that provides core modules for constructing hyperbolic foundation models across multiple modalities. HyperCore's modules can be effortlessly combined to develop novel hyperbolic foundation models, eliminating the need to extensively modify Euclidean modules from scratch and possible redundant research efforts. To demonstrate its versatility, we build and test the first fully hyperbolic vision transformers (LViT) with a fine-tuning pipeline, the first fully hyperbolic multimodal CLIP model (L-CLIP), and a hybrid Graph RAG with a hyperbolic graph encoder. Our experiments demonstrate that LViT outperforms its Euclidean counterpart. Additionally, we benchmark and reproduce experiments across hyperbolic GNNs, CNNs, Transformers, and vision Transformers to highlight HyperCore's advantages.

Title: An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline

Authors: Junkyum Kim, Divya Mahajan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08930
Pdf URL: https://arxiv.org/pdf/2504.08930
Copy Paste: [[2504.08930]] An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline(https://arxiv.org/abs/2504.08930)
Keywords: large language model
Abstract: Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases, enabling external knowledge retrieval to support language model reasoning. While RAG enables efficient question answering with smaller LLMs, existing optimizations for vector search and LLM serving have largely been developed in isolation. As a result, their integration often leads to suboptimal end-to-end performance. ... This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems that enhances the responsiveness of the system by jointly optimizing vector search and LLM serving across CPU and GPU system. A key challenge is to determine which indices and how much of the vector index should reside on the GPU and adjusting LLM batch sizes to balance the pipeline for lower Time-To-First-Token (TTFT) and meeting user-defined Service-Level Objectives (SLOs). To address this, we leverage the insight that cluster access in vector databases exhibits access skew, where a subset of clusters are queried significantly more frequently than others. VectorLiteRAG exploits this property through an optimized memory distribution strategy, dynamically allocating the minimum number of vector indices corresponding to frequently accessed clusters onto the GPU HBM to ensure a balanced pipeline with the LLM for high responsiveness. This adaptive partitioning scheme is guided by a statistical model that informs memory allocation and workload distribution. Our evaluation demonstrates that VectorLiteRAG improves vector search responsiveness by 2x, significantly reduces end-to-end TTFT in RAG systems by intelligently balancing memory resources between vector search and LLM execution.

Title: Long Context In-Context Compression by Getting to the Gist of Gisting

Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08934
Pdf URL: https://arxiv.org/pdf/2504.08934
Copy Paste: [[2504.08934]] Long Context In-Context Compression by Getting to the Gist of Gisting(https://arxiv.org/abs/2504.08934)
Keywords: transformer
Abstract: Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.

Title: Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Authors: Mehmet Arif Demirtaş, Claire Zheng, Max Fowler, Kathryn Cunningham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08958
Pdf URL: https://arxiv.org/pdf/2504.08958
Copy Paste: [[2504.08958]] Generating Planning Feedback for Open-Ended Programming Exercises with LLMs(https://arxiv.org/abs/2504.08958)
Keywords: large language model
Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.

Title: MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Authors: Yilin Wang, Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Xinxin Zuo, Juwei Lu, Hai Jiang, Li Cheng
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.08959
Pdf URL: https://arxiv.org/pdf/2504.08959
Copy Paste: [[2504.08959]] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer(https://arxiv.org/abs/2504.08959)
Keywords: robust, diffusion, transformer, generative
Abstract: Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: this https URL

Title: A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Authors: Kseniia Petukhova, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08961
Pdf URL: https://arxiv.org/pdf/2504.08961
Copy Paste: [[2504.08961]] A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models(https://arxiv.org/abs/2504.08961)
Keywords: large language model
Abstract: Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.

Title: Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion

Authors: Qisai Liu, Zhanhong Jiang, Joshua R. Waite, Chao Liu, Aditya Balu, Soumik Sarkar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08964
Pdf URL: https://arxiv.org/pdf/2504.08964
Copy Paste: [[2504.08964]] Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion(https://arxiv.org/abs/2504.08964)
Keywords: transformer
Abstract: Sequence modeling is a critical yet challenging task with wide-ranging applications, especially in time series forecasting for domains like weather prediction, temperature monitoring, and energy load forecasting. Transformers, with their attention mechanism, have emerged as state-of-the-art due to their efficient parallel training, but they suffer from quadratic time complexity, limiting their scalability for long sequences. In contrast, recurrent neural networks (RNNs) offer linear time complexity, spurring renewed interest in linear RNNs for more computationally efficient sequence modeling. In this work, we introduce BLUR (Bidirectional Linear Unit for Recurrent network), which uses forward and backward linear recurrent units (LRUs) to capture both past and future dependencies with high computational efficiency. BLUR maintains the linear time complexity of traditional RNNs, while enabling fast parallel training through LRUs. Furthermore, it offers provably stable training and strong approximation capabilities, making it highly effective for modeling long-term dependencies. Extensive experiments on sequential image and time series datasets reveal that BLUR not only surpasses transformers and traditional RNNs in accuracy but also significantly reduces computational costs, making it particularly suitable for real-world forecasting tasks. Our code is available here.

Title: RAG-Based Fuzzing of Cross-Architecture Compilers

Authors: Rana Elnaggar, Brian Delgado, Jason M. Fung
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08967
Pdf URL: https://arxiv.org/pdf/2504.08967
Copy Paste: [[2504.08967]] RAG-Based Fuzzing of Cross-Architecture Compilers(https://arxiv.org/abs/2504.08967)
Keywords: security
Abstract: OneAPI is an open standard that supports cross-architecture software development with minimal effort from developers. It brings DPC++ and C++ compilers which need to be thoroughly tested to verify their correctness, reliability, and security. Compilers have numerous code flows and optimization features. This process requires developers with deep understanding of the different compiler flows to craft testcases specific to target paths in the compiler. This testcase creation is a time-consuming and costly process. In this paper, we propose a large-language model (LLM)-based compiler fuzzing tool that integrates the concept of retrieval-augmented generation (RAG). This tool automates the testcase generation task and relieves experienced compiler developers from investing time to craft testcase generation patterns. We test our proposed approach on the Intel DPC++/C++ compiler. This compiler compiles SYCL code and allows developers to offload it to different architectures, e.g. GPUs and CPUs from different vendors. Using this tool, we managed to identify 87 SYCL code test cases that lead to output value mismatch or compiler runtime errors when compiled using Intel DPC++ and clang++ compilers and run on different architectures. The testcases and the identified unexpected behaviors of the compilers under test were obtained within only few hours with no prior background on the compiler passes under tests. This tool facilitates efficient compiler fuzzing with reduced developer time requirements via the dynamic testcase creation capability provided by an LLM with RAG.

Title: Robust Steganography from Large Language Models

Authors: Neil Perry, Sanket Gupte, Nishant Pitta, Lior Rotem
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08977
Pdf URL: https://arxiv.org/pdf/2504.08977
Copy Paste: [[2504.08977]] Robust Steganography from Large Language Models(https://arxiv.org/abs/2504.08977)
Keywords: attack, robust, large language model
Abstract: Recent steganographic schemes, starting with Meteor (CCS'21), rely on leveraging large language models (LLMs) to resolve a historically-challenging task of disguising covert communication as ``innocent-looking'' natural-language communication. However, existing methods are vulnerable to ``re-randomization attacks,'' where slight changes to the communicated text, that might go unnoticed, completely destroy any hidden message. This is also a vulnerability in more traditional encryption-based stegosystems, where adversaries can modify the randomness of an encryption scheme to destroy the hidden message while preserving an acceptable covertext to ordinary users. In this work, we study the problem of robust steganography. We introduce formal definitions of weak and strong robust LLM-based steganography, corresponding to two threat models in which natural language serves as a covertext channel resistant to realistic re-randomization attacks. We then propose two constructions satisfying these notions. We design and implement our steganographic schemes that embed arbitrary secret messages into natural language text generated by LLMs, ensuring recoverability even under adversarial paraphrasing and rewording attacks. To support further research and real-world deployment, we release our implementation and datasets for public use.

Title: AGENT: An Aerial Vehicle Generation and Design Tool Using Large Language Models

Authors: Colin Samplawski, Adam D. Cobb, Susmit Jha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08981
Pdf URL: https://arxiv.org/pdf/2504.08981
Copy Paste: [[2504.08981]] AGENT: An Aerial Vehicle Generation and Design Tool Using Large Language Models(https://arxiv.org/abs/2504.08981)
Keywords: large language model
Abstract: Computer-aided design (CAD) is a promising application area for emerging artificial intelligence methods. Traditional workflows for cyberphysical systems create detailed digital models which can be evaluated by physics simulators in order to narrow the search space before creating physical prototypes. A major bottleneck of this approach is that the simulators are often computationally expensive and slow. Recent advancements in AI methods offer the possibility to accelerate these pipelines. We use the recently released AircraftVerse dataset, which is especially suited for developing and evaluating large language models for designs. AircraftVerse contains a diverse set of UAV designs represented via textual design trees together with detailed physics simulation results. Following the recent success of large language models (LLMs), we propose AGENT (Aircraft GENeraTor). AGENT is a comprehensive design tool built on the CodeT5+ LLM which learns powerful representations of aircraft textual designs directly from JSON files. We develop a curriculum of training tasks which imbues a single model with a suite of useful features. AGENT is able to generate designs conditioned on properties of flight dynamics (hover time, maximum speed, etc.). Additionally, AGENT can issue evaluations of designs allowing it to act as a surrogate model of the physics simulation that underlies the AircraftVerse dataset. We present a series of experiments which demonstrate our system's abilities. We are able to achieve strong performance using the smallest member of the CodeT5+ family (220M parameters). This allows for a flexible and powerful system which can be executed on a single GPU enabling a clear path toward future deployment.

Title: Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning

Authors: Kyle Stein, Andrew Arash Mahyari, Guillermo Francia III, Eman El-Sheikh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08982
Pdf URL: https://arxiv.org/pdf/2504.08982
Copy Paste: [[2504.08982]] Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning(https://arxiv.org/abs/2504.08982)
Keywords: robust, transformer
Abstract: Integrating new class information without losing previously acquired knowledge remains a central challenge in artificial intelligence, often referred to as catastrophic forgetting. Few-shot class incremental learning (FSCIL) addresses this by first training a model on a robust dataset of base classes and then incrementally adapting it in successive sessions using only a few labeled examples per novel class. However, this approach is prone to overfitting on the limited new data, which can compromise overall performance and exacerbate forgetting. In this work, we propose a simple yet effective novel FSCIL framework that leverages a frozen Vision Transformer (ViT) backbone augmented with parameter-efficient additive updates. Our approach freezes the pre-trained ViT parameters and selectively injects trainable weights into the self-attention modules via an additive update mechanism. This design updates only a small subset of parameters to accommodate new classes without sacrificing the representations learned during the base session. By fine-tuning a limited number of parameters, our method preserves the generalizable features in the frozen ViT while reducing the risk of overfitting. Furthermore, as most parameters remain fixed, the model avoids overwriting previously learned knowledge when small novel data batches are introduced. Extensive experiments on benchmark datasets demonstrate that our approach yields state-of-the-art performance compared to baseline FSCIL methods.

Title: MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers

Authors: Arash Ahmadi, Sarah Sharif, Yaser M. Banad
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08999
Pdf URL: https://arxiv.org/pdf/2504.08999
Copy Paste: [[2504.08999]] MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers(https://arxiv.org/abs/2504.08999)
Keywords: security, large language model
Abstract: Large Language Models (LLMs) are increasingly augmented with external tools through standardized interfaces like the Model Context Protocol (MCP). However, current MCP implementations face critical limitations: they typically require local process execution through STDIO transports, making them impractical for resource-constrained environments like mobile devices, web browsers, and edge computing. We present MCP Bridge, a lightweight RESTful proxy that connects to multiple MCP servers and exposes their capabilities through a unified API. Unlike existing solutions, MCP Bridge is fully LLM-agnostic, supporting any backend regardless of vendor. The system implements a risk-based execution model with three security levels standard execution, confirmation workflow, and Docker isolation while maintaining backward compatibility with standard MCP clients. Complementing this server-side infrastructure is a Python based MCP Gemini Agent that facilitates natural language interaction with MCP tools. The evaluation demonstrates that MCP Bridge successfully addresses the constraints of direct MCP connections while providing enhanced security controls and cross-platform compatibility, enabling sophisticated LLM-powered applications in previously inaccessible environments

Title: Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

Authors: Jiawei Li
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.09026
Pdf URL: https://arxiv.org/pdf/2504.09026
Copy Paste: [[2504.09026]] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function(https://arxiv.org/abs/2504.09026)
Keywords: security, attack, large language model
Abstract: Instruction fine-tuning attacks pose a significant threat to large language models (LLMs) by subtly embedding poisoned data in fine-tuning datasets, which can trigger harmful or unintended responses across a range of tasks. This undermines model alignment and poses security risks in real-world deployment. In this work, we present a simple and effective approach to detect and mitigate such attacks using influence functions, a classical statistical tool adapted for machine learning interpretation. Traditionally, the high computational costs of influence functions have limited their application to large models and datasets. The recent Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation method enables efficient influence score computation, making it feasible for large-scale analysis. We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets, as both the instruction fine-tuning attack on language models and the influence calculation approximation technique are relatively new. Our large-scale empirical evaluation of influence functions on 50,000 fine-tuning examples and 32 tasks reveals a strong association between influence scores and sentiment. Building on this, we introduce a novel sentiment transformation combined with influence functions to detect and remove critical poisons -- poisoned data points that skew model predictions. Removing these poisons (only 1% of total data) recovers model performance to near-clean levels, demonstrating the effectiveness and efficiency of our approach. Artifact is available at this https URL. WARNING: This paper contains offensive data examples.

Title: Associating transportation planning-related measures with Mild Cognitive Impairment

Authors: Souradeep Chattopadhyay, Guillermo Basulto-Elias, Jun Ha Chang, Matthew Rizzo, Shauna Hallmark, Anuj Sharma, Soumik Sarkar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09027
Pdf URL: https://arxiv.org/pdf/2504.09027
Copy Paste: [[2504.09027]] Associating transportation planning-related measures with Mild Cognitive Impairment(https://arxiv.org/abs/2504.09027)
Keywords: robust
Abstract: Understanding the relationship between mild cognitive impairment and driving behavior is essential to improve road safety, especially among older adults. In this study, we computed certain variables that reflect daily driving habits, such as trips to specific locations (e.g., home, work, medical, social, and errands) of older drivers in Nebraska using geohashing. The computed variables were then analyzed using a two-fold approach involving data visualization and machine learning models (C5.0, Random Forest, Support Vector Machines) to investigate the efficiency of the computed variables in predicting whether a driver is cognitively impaired or unimpaired. The C5.0 model demonstrated robust and stable performance with a median recall of 74\%, indicating that our methodology was able to identify cognitive impairment in drivers 74\% of the time correctly. This highlights our model's effectiveness in minimizing false negatives which is an important consideration given the cost of missing impaired drivers could be potentially high. Our findings highlight the potential of life space variables in understanding and predicting cognitive decline, offering avenues for early intervention and tailored support for affected individuals.

Title: Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Authors: Gen Li, Yang Xiao, Jie Ji, Kaiyuan Deng, Bo Hui, Linke Guo, Xiaolong Ma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09039
Pdf URL: https://arxiv.org/pdf/2504.09039
Copy Paste: [[2504.09039]] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization(https://arxiv.org/abs/2504.09039)
Keywords: diffusion, generative
Abstract: Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbf{Dynamic Mask coupled with Concept-Aware Loss}, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbf{Dynamic Mask} mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbf{Concept-Aware Loss} explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.

Title: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

Authors: Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09049
Pdf URL: https://arxiv.org/pdf/2504.09049
Copy Paste: [[2504.09049]] From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy(https://arxiv.org/abs/2504.09049)
Keywords: large language model
Abstract: Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at this https URL.

Title: Multimodal 3D Genome Pre-training

Authors: Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei Zhang, Zhi-An Huang, Hui Xiong
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2504.09060
Pdf URL: https://arxiv.org/pdf/2504.09060
Copy Paste: [[2504.09060]] Multimodal 3D Genome Pre-training(https://arxiv.org/abs/2504.09060)
Keywords: robust
Abstract: Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.

Title: Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

Authors: Yifan Yang, Lei Zou, Bing Zhou, Daoyang Li, Binbin Lin, Joynal Abedin, Mingzheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09066
Pdf URL: https://arxiv.org/pdf/2504.09066
Copy Paste: [[2504.09066]] Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models(https://arxiv.org/abs/2504.09066)
Keywords: transformer
Abstract: Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.

Title: Exploring Synergistic Ensemble Learning: Uniting CNNs, MLP-Mixers, and Vision Transformers to Enhance Image Classification

Authors: Mk Bashar, Ocean Monjur, Samia Islam, Mohammad Galib Shams, Niamul Quader
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09076
Pdf URL: https://arxiv.org/pdf/2504.09076
Copy Paste: [[2504.09076]] Exploring Synergistic Ensemble Learning: Uniting CNNs, MLP-Mixers, and Vision Transformers to Enhance Image Classification(https://arxiv.org/abs/2504.09076)
Keywords: transformer
Abstract: In recent years, Convolutional Neural Networks (CNNs), MLP-mixers, and Vision Transformers have risen to prominence as leading neural architectures in image classification. Prior research has underscored the distinct advantages of each architecture, and there is growing evidence that combining modules from different architectures can boost performance. In this study, we build upon and improve previous work exploring the complementarity between different architectures. Instead of heuristically merging modules from various architectures through trial and error, we preserve the integrity of each architecture and combine them using ensemble techniques. By maintaining the distinctiveness of each architecture, we aim to explore their inherent complementarity more deeply and with implicit isolation. This approach provides a more systematic understanding of their individual strengths. In addition to uncovering insights into architectural complementarity, we showcase the effectiveness of even basic ensemble methods that combine models from diverse architectures. These methods outperform ensembles comprised of similar architectures. Our straightforward ensemble framework serves as a foundational strategy for blending complementary architectures, offering a solid starting point for further investigations into the unique strengths and synergies among different architectures and their ensembles in image classification. A direct outcome of this work is the creation of an ensemble of classification networks that surpasses the accuracy of the previous state-of-the-art single classification network on ImageNet, setting a new benchmark, all while requiring less overall latency.

Title: A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

Authors: Bingyu Nan, Feng Liu, Xuezhong Qian, Wei Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09077
Pdf URL: https://arxiv.org/pdf/2504.09077
Copy Paste: [[2504.09077]] A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext(https://arxiv.org/abs/2504.09077)
Keywords: extraction
Abstract: Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.

Title: crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels

Authors: Marek Herde, Lukas Lührs, Denis Huseljic, Bernhard Sick
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09085
Pdf URL: https://arxiv.org/pdf/2504.09085
Copy Paste: [[2504.09085]] crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels(https://arxiv.org/abs/2504.09085)
Keywords: fair
Abstract: Crowdworking is a cost-efficient solution to acquire class labels. Since these labels are subject to noise, various approaches to learning from crowds have been proposed. Typically, these approaches are evaluated with default hyperparameters, resulting in suboptimal performance, or with hyperparameters tuned using a validation set with ground truth class labels, representing an often unrealistic scenario. Moreover, both experimental setups can produce different rankings of approaches, complicating comparisons between studies. Therefore, we introduce crowd-hpo as a realistic benchmark and experimentation protocol including hyperparameter optimization under noisy crowd-labeled data. At its core, crowd-hpo investigates model selection criteria to identify well-performing hyperparameter configurations only with access to noisy crowd-labeled validation data. Extensive experimental evaluations with neural networks show that these criteria are effective for optimizing hyperparameters in learning from crowds approaches. Accordingly, incorporating such criteria into experimentation protocols is essential for enabling more realistic and fair benchmarking.

Title: Privacy Preservation in Gen AI Applications

Authors: Swetha S, Ram Sundhar K Shaju, Rakshana M, Ganesh R, Balavedhaa S, Thiruvaazhi U
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09095
Pdf URL: https://arxiv.org/pdf/2504.09095
Copy Paste: [[2504.09095]] Privacy Preservation in Gen AI Applications(https://arxiv.org/abs/2504.09095)
Keywords: secure, security, privacy, protect, attack, extraction, membership infer, generative, large language model
Abstract: The ability of machines to comprehend and produce language that is similar to that of humans has revolutionized sectors like customer service, healthcare, and finance thanks to the quick advances in Natural Language Processing (NLP), which are fueled by Generative Artificial Intelligence (AI) and Large Language Models (LLMs). However, because LLMs trained on large datasets may unintentionally absorb and reveal Personally Identifiable Information (PII) from user interactions, these capabilities also raise serious privacy concerns. Deep neural networks' intricacy makes it difficult to track down or stop the inadvertent storing and release of private information, which raises serious concerns about the privacy and security of AI-driven data. This study tackles these issues by detecting Generative AI weaknesses through attacks such as data extraction, model inversion, and membership inference. A privacy-preserving Generative AI application that is resistant to these assaults is then developed. It ensures privacy without sacrificing functionality by using methods to identify, alter, or remove PII before to dealing with LLMs. In order to determine how well cloud platforms like Microsoft Azure, Google Cloud, and AWS provide privacy tools for protecting AI applications, the study also examines these technologies. In the end, this study offers a fundamental privacy paradigm for generative AI systems, focusing on data security and moral AI implementation, and opening the door to a more secure and conscientious use of these tools.

Title: BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

Authors: Jeongwan On, Kyeonghwan Gwak, Gunyoung Kang, Junuk Cha, Soohyun Hwang, Hyein Hwang, Seungryul Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09097
Pdf URL: https://arxiv.org/pdf/2504.09097
Copy Paste: [[2504.09097]] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting(https://arxiv.org/abs/2504.09097)
Keywords: robust, diffusion
Abstract: Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

Title: Synthetic Aircraft Trajectory Generation Using Time-Based VQ-VAE

Authors: Abdulmajid Murad, Massimiliano Ruocco
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09101
Pdf URL: https://arxiv.org/pdf/2504.09101
Copy Paste: [[2504.09101]] Synthetic Aircraft Trajectory Generation Using Time-Based VQ-VAE(https://arxiv.org/abs/2504.09101)
Keywords: protect, transformer
Abstract: In modern air traffic management, generating synthetic flight trajectories has emerged as a promising solution for addressing data scarcity, protecting sensitive information, and supporting large-scale analyses. In this paper, we propose a novel method for trajectory synthesis by adapting the Time-Based Vector Quantized Variational Autoencoder (TimeVQVAE). Our approach leverages time-frequency domain processing, vector quantization, and transformer-based priors to capture both global and local dynamics in flight data. By discretizing the latent space and integrating transformer priors, the model learns long-range spatiotemporal dependencies and preserves coherence across entire flight paths. We evaluate the adapted TimeVQVAE using an extensive suite of quality, statistical, and distributional metrics, as well as a flyability assessment conducted in an open-source air traffic simulator. Results indicate that TimeVQVAE outperforms a temporal convolutional VAE baseline, generating synthetic trajectories that mirror real flight data in terms of spatial accuracy, temporal consistency, and statistical properties. Furthermore, the simulator-based assessment shows that most generated trajectories maintain operational feasibility, although occasional outliers underscore the potential need for additional domain-specific constraints. Overall, our findings underscore the importance of multi-scale representation learning for capturing complex flight behaviors and demonstrate the promise of TimeVQVAE in producing representative synthetic trajectories for downstream tasks such as model training, airspace design, and air traffic forecasting.

Title: Shrinkage Initialization for Smooth Learning of Neural Networks

Authors: Miao Cheng, Feiyan Zhou, Hongwei Zou, Limin Wang
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2504.09107
Pdf URL: https://arxiv.org/pdf/2504.09107
Copy Paste: [[2504.09107]] Shrinkage Initialization for Smooth Learning of Neural Networks(https://arxiv.org/abs/2504.09107)
Keywords: robust
Abstract: The successes of intelligent systems have quite relied on the artificial learning of information, which lead to the broad applications of neural learning solutions. As a common sense, the training of neural networks can be largely improved by specifically defined initialization, neuron layers as well as the activation functions. Though there are sequential layer based initialization available, the generalized solution to initial stages is still desired. In this work, an improved approach to initialization of neural learning is presented, which adopts the shrinkage approach to initialize the transformation of each layer of networks. It can be universally adapted for the structures of any networks with random layers, while stable performance can be attained. Furthermore, the smooth learning of networks is adopted in this work, due to the diverse influence on neural learning. Experimental results on several artificial data sets demonstrate that, the proposed method is able to present robust results with the shrinkage initialization, and competent for smooth learning of neural networks.

Title: Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

Authors: Ganxi Xu, Jinyi Long, Hanrui Wu, Jia Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09109
Pdf URL: https://arxiv.org/pdf/2504.09109
Copy Paste: [[2504.09109]] Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding(https://arxiv.org/abs/2504.09109)
Keywords: privacy
Abstract: Brain decoding currently faces significant challenges in individual differences, modality alignment, and high-dimensional embeddings. To address individual differences, researchers often use source subject data, which leads to issues such as privacy leakage and heavy data storage burdens. In modality alignment, current works focus on aligning the softmax probability distribution but neglect the alignment of marginal probability distributions, resulting in modality misalignment. Additionally, images and text are aligned separately with fMRI without considering the complex interplay between images and text, leading to poor image reconstruction. Finally, the enormous dimensionality of CLIP embeddings causes significant computational costs. Although the dimensionality of CLIP embeddings can be reduced by ignoring the number of patches obtained from images and the number of tokens acquired from text, this comes at the cost of a significant drop in model performance, creating a dilemma. To overcome these limitations, we propose a source-free domain adaptation-based brain decoding framework

Title: Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning

Authors: Xianke Qiang, Hongda Liu, Xinran Zhang, Zheng Chang, Ying-Chang Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09114
Pdf URL: https://arxiv.org/pdf/2504.09114
Copy Paste: [[2504.09114]] Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning(https://arxiv.org/abs/2504.09114)
Keywords: privacy, federate
Abstract: Large Artificial Intelligence Models (LAMs) powered by massive datasets, extensive parameter scales, and extensive computational resources, leading to significant transformations across various industries. Yet, their practical deployment on resource-limited mobile edge devices is hindered by critical challenges such as data privacy, constrained resources, and high overhead costs. Addressing this gap, this paper proposes a novel framework, named Quantized Split Federated Fine-Tuning Large AI Model (SFLAM). By partitioning the training load between edge devices and servers using a split learning paradigm, SFLAM can facilitate the operation of large models on devices and significantly lowers the memory requirements on edge devices. Additionally, SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency while concurrently reducing energy consumption and communication latency. A theoretical analysis exploring the latency-energy trade-off is presented, and the framework's efficacy is validated via comprehensive simulations. The findings indicate that SFLAM achieves superior performance in terms of learning efficiency and scalability compared to conventional methods, thereby providing a valuable approach for enabling advanced AI services in resource-constrained scenarios.

Title: CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift

Authors: Jiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, Frank Liau
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09115
Pdf URL: https://arxiv.org/pdf/2504.09115
Copy Paste: [[2504.09115]] CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift(https://arxiv.org/abs/2504.09115)
Keywords: attack, robust
Abstract: With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Besides, another critical issue to consider is normality shift, which implies the test distribution could differ from the training distribution and highly affects the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other important and cloud-specific shift types are ignored, e.g., the distribution shift introduced by different deployed cloud architectures. Therefore, creating a new dataset that covers diverse behaviors of cloud systems and normality shift types is necessary. To fill the gap in evaluating LAD under real-world conditions, we present CAShift, the first normality shift-aware dataset for cloud systems. CAShift captures three shift types, including application, version, and cloud architecture shifts, and includes 20 diverse attack scenarios across various cloud components. Using CAShift, we conduct an empirical study showing that (1) all LAD methods are significantly affected by normality shifts, with performance drops of up to 34%, and (2) continuous learning techniques can improve F1-scores by up to 27%, depending on data usage and algorithm choice. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.

Title: Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis

Authors: Matthew B. Webster, Dongheon Lee, Joonnyong Lee
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.09132
Pdf URL: https://arxiv.org/pdf/2504.09132
Copy Paste: [[2504.09132]] Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis(https://arxiv.org/abs/2504.09132)
Keywords: robust, extraction
Abstract: Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of BSS in biosignal analysis.

Title: Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Authors: Haotian Ye, Himanshu Jain, Chong You, Ananda Theertha Suresh, Haowei Lin, James Zou, Felix Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09135
Pdf URL: https://arxiv.org/pdf/2504.09135
Copy Paste: [[2504.09135]] Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models(https://arxiv.org/abs/2504.09135)
Keywords: large language model
Abstract: In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

Title: Kernel-Based Enhanced Oversampling Method for Imbalanced Classification

Authors: Wenjie Li, Sibo Zhu, Zhijian Li, Hanlin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09147
Pdf URL: https://arxiv.org/pdf/2504.09147
Copy Paste: [[2504.09147]] Kernel-Based Enhanced Oversampling Method for Imbalanced Classification(https://arxiv.org/abs/2504.09147)
Keywords: robust
Abstract: This paper introduces a novel oversampling technique designed to improve classification performance on imbalanced datasets. The proposed method enhances the traditional SMOTE algorithm by incorporating convex combination and kernel-based weighting to generate synthetic samples that better represent the minority class. Through experiments on multiple real-world datasets, we demonstrate that the new technique outperforms existing methods in terms of F1-score, G-mean, and AUC, providing a robust solution for handling imbalanced datasets in classification tasks.

Title: MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data

Authors: Wentao Li, Yizhe Chen, Jiangjie Qiu, Xiaonan Wang
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.09152
Pdf URL: https://arxiv.org/pdf/2504.09152
Copy Paste: [[2504.09152]] MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data(https://arxiv.org/abs/2504.09152)
Keywords: generative
Abstract: Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.

Title: Secure Physical Layer Communications for Low-Altitude Economy Networking: A Survey

Authors: Lingyi Cai, Jiacheng Wang, Ruichen Zhang, Yu Zhang, Tao Jiang, Dusit Niyato, Xianbin Wang, Abbas Jamalipour, Xuemin Shen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09153
Pdf URL: https://arxiv.org/pdf/2504.09153
Copy Paste: [[2504.09153]] Secure Physical Layer Communications for Low-Altitude Economy Networking: A Survey(https://arxiv.org/abs/2504.09153)
Keywords: secure, security, protect, defense
Abstract: The Low-Altitude Economy Networking (LAENet) is emerging as a transformative paradigm that enables an integrated and sophisticated communication infrastructure to support aerial vehicles in carrying out a wide range of economic activities within low-altitude airspace. However, the physical layer communications in the LAENet face growing security threats due to inherent characteristics of aerial communication environments, such as signal broadcast nature and channel openness. These challenges highlight the urgent need for safeguarding communication confidentiality, availability, and integrity. In view of the above, this survey comprehensively reviews existing secure countermeasures for physical layer communication in the LAENet. We explore core methods focusing on anti-eavesdropping and authentication for ensuring communication confidentiality. Subsequently, availability-enhancing techniques are thoroughly discussed for anti-jamming and spoofing defense. Then, we review approaches for safeguarding integrity through anomaly detection and injection protection. Furthermore, we discuss future research directions, emphasizing energy-efficient physical layer security, multi-drone collaboration for secure communication, AI-driven security defense strategy, space-air-ground integrated security architecture, and 6G-enabled secure UAV communication. This survey may provide valuable references and new insights for researchers in the field of secure physical layer communication for the LAENet.

Title: Evolved Hierarchical Masking for Self-Supervised Learning

Authors: Zhanzhou Feng, Shiliang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09155
Pdf URL: https://arxiv.org/pdf/2504.09155
Copy Paste: [[2504.09155]] Evolved Hierarchical Masking for Self-Supervised Learning(https://arxiv.org/abs/2504.09155)
Keywords: segmentation
Abstract: Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

Title: LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

Authors: Shengyu Gong, Yueyang Li, Zijian Kang, Weiming Zeng, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09156
Pdf URL: https://arxiv.org/pdf/2504.09156
Copy Paste: [[2504.09156]] LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography(https://arxiv.org/abs/2504.09156)
Keywords: robust
Abstract: Accurate and efficient perception of emotional states in oneself and others is crucial, as emotion-related disorders are associated with severe psychosocial impairments. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we propose LEREL (Lipschitz continuity-constrained Emotion Recognition Ensemble Learning), a novel framework that significantly enhances both the accuracy and robustness of emotion recognition performance. The LEREL framework employs Lipschitz continuity constraints to enhance model stability and generalization in EEG emotion recognition, reducing signal variability and noise susceptibility while maintaining strong performance on small-sample datasets. The ensemble learning strategy reduces single-model bias and variance through multi-classifier decision fusion, further optimizing overall performance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrate LEREL's effectiveness, achieving average recognition accuracies of 76.43%, 83.00% and 89.22%, respectively.

Title: Can postgraduate translation students identify machine-generated text?

Authors: Michael Farrell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09164
Pdf URL: https://arxiv.org/pdf/2504.09164
Copy Paste: [[2504.09164]] Can postgraduate translation students identify machine-generated text?(https://arxiv.org/abs/2504.09164)
Keywords: generative
Abstract: Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

Title: Langformers: Unified NLP Pipelines for Language Models

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09170
Pdf URL: https://arxiv.org/pdf/2504.09170
Copy Paste: [[2504.09170]] Langformers: Unified NLP Pipelines for Language Models(https://arxiv.org/abs/2504.09170)
Keywords: transformer, large language model
Abstract: Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: this https URL

Title: A Confounding Factors-Inhibition Adversarial Learning Framework for Multi-site fMRI Mental Disorder Identification

Authors: Xin Wen, Shijie Guo, Wenbo Ning, Rui Cao, Yan Niu, Bin Wan, Peng Wei, Xiaobo Liu, Jie Xiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09179
Pdf URL: https://arxiv.org/pdf/2504.09179
Copy Paste: [[2504.09179]] A Confounding Factors-Inhibition Adversarial Learning Framework for Multi-site fMRI Mental Disorder Identification(https://arxiv.org/abs/2504.09179)
Keywords: extraction
Abstract: In open data sets of functional magnetic resonance imaging (fMRI), the heterogeneity of the data is typically attributed to a combination of factors, including differences in scanning procedures, the presence of confounding effects, and population diversities between multiple sites. These factors contribute to the diminished effectiveness of representation learning, which in turn affects the overall efficacy of subsequent classification procedures. To address these limitations, we propose a novel multi-site adversarial learning network (MSalNET) for fMRI-based mental disorder detection. Firstly, a representation learning module is introduced with a node information assembly (NIA) mechanism to better extract features from functional connectivity (FC). This mechanism aggregates edge information from both horizontal and vertical directions, effectively assembling node information. Secondly, to generalize the feature across sites, we proposed a site-level feature extraction module that can learn from individual FC data, which circumvents additional prior information. Lastly, an adversarial learning network is proposed as a means of balancing the trade-off between individual classification and site regression tasks, with the introduction of a novel loss function. The proposed method was evaluated on two multi-site fMRI datasets, i.e., Autism Brain Imaging Data Exchange (ABIDE) and ADHD-200. The results indicate that the proposed method achieves a better performance than other related algorithms with the accuracy of 75.56 and 68.92 in ABIDE and ADHD-200 datasets, respectively. Furthermore, the result of the site regression indicates that the proposed method reduces site variability from a data-driven perspective. The most discriminative brain regions revealed by NIA are consistent with statistical findings, uncovering the "black box" of deep learning to a certain extent.

Title: A Multi-Layered Security Analysis of Blockchain Systems: From Attack Vectors to Defense and System Hardening

Authors: Yuhuan Yang, Shipeng Ye, Xiaoqi Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09181
Pdf URL: https://arxiv.org/pdf/2504.09181
Copy Paste: [[2504.09181]] A Multi-Layered Security Analysis of Blockchain Systems: From Attack Vectors to Defense and System Hardening(https://arxiv.org/abs/2504.09181)
Keywords: security, defense, attack
Abstract: The application of Bitcoin enables people to understand blockchain technology gradually. Bitcoin is a decentralized currency that does not rely on third-party credit institutions, and the core of Bitcoin's underlying technology is blockchain. With the increasing value of Bitcoin and the vigorous development of decentralization, people's research on blockchain is also increasing day by day. Today's blockchain technology has not only made great achievements in the application of Bitcoin, but has also been preliminarily applied in other fields, such as finance, medical treatment, the Internet of Things, and so on. However, with the initial application of blockchain technology on the Internet, the security of blockchain technology has also been widely concerned by people in the industry. For example, whether currency trading platforms, smart contracts, blockchain consensus mechanisms, and other technologies are vulnerable to attacks, and how we can defend against these attacks digitally and optimize the blockchain system is exactly the subject we want to study. For the security of appeal blockchain, this paper first analyzes the security threats faced by the application digital currency trading platform of the blockchain system, then analyzes the security problems of smart contract closely related to blockchain 2.0, and then analyzes and studies the security threats of blockchain public chain, consensus mechanism, and P2P. Finally, combined with the security problems at all levels of the blockchain system we analyze and study how to optimize the security of the blockchain system.

Title: Feature-Aware Malicious Output Detection and Mitigation

Authors: Weilong Dong, Peiguang Li, Yu Tian, Xinyi Zeng, Fengdi Li, Sirui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09191
Pdf URL: https://arxiv.org/pdf/2504.09191
Copy Paste: [[2504.09191]] Feature-Aware Malicious Output Detection and Mitigation(https://arxiv.org/abs/2504.09191)
Keywords: defense, attack, large language model
Abstract: The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.

Title: Towards More Efficient, Robust, Instance-adaptive, and Generalizable Online Learning

Authors: Zhiyong Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09192
Pdf URL: https://arxiv.org/pdf/2504.09192
Copy Paste: [[2504.09192]] Towards More Efficient, Robust, Instance-adaptive, and Generalizable Online Learning(https://arxiv.org/abs/2504.09192)
Keywords: robust, large language model
Abstract: The primary goal of my Ph.D. study is to develop provably efficient and practical algorithms for data-driven online sequential decision-making under uncertainty. My work focuses on reinforcement learning (RL), multi-armed bandits, and their applications, including recommendation systems, computer networks, video analytics, and large language models (LLMs). Online learning methods, such as bandits and RL, have demonstrated remarkable success - ranging from outperforming human players in complex games like Atari and Go to advancing robotics, recommendation systems, and fine-tuning LLMs. Despite these successes, many established algorithms rely on idealized models that can fail under model misspecifications or adversarial perturbations, particularly in settings where accurate prior knowledge of the underlying model class is unavailable or where malicious users operate within dynamic systems. These challenges are pervasive in real-world applications, where robust and adaptive solutions are critical. Furthermore, while worst-case guarantees provide theoretical reliability, they often fail to capture instance-dependent performance, which can lead to more efficient and practical solutions. Another key challenge lies in generalizing to new, unseen environments, a crucial requirement for deploying these methods in dynamic and unpredictable settings. To address these limitations, my research aims to develop more efficient, robust, instance-adaptive, and generalizable online learning algorithms for both reinforcement learning and bandits. Towards this end, I focus on developing more efficient, robust, instance-adaptive, and generalizable for both general reinforcement learning (RL) and bandits.

Title: ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

Authors: Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, Nikos Deligiannis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09195
Pdf URL: https://arxiv.org/pdf/2504.09195
Copy Paste: [[2504.09195]] ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking(https://arxiv.org/abs/2504.09195)
Keywords: robust, large language model
Abstract: Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on this https URL

Title: RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

Authors: Feng Lv, Chunlong Xia, Shuo Wang, Huo Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09196
Pdf URL: https://arxiv.org/pdf/2504.09196
Copy Paste: [[2504.09196]] RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning(https://arxiv.org/abs/2504.09196)
Keywords: transformer
Abstract: Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Our code will be released soon.

Title: Illusion Worlds: Deceptive UI Attacks in Social VR

Authors: Junhee Lee, Hwanjo Heo, Seungwon Woo, Minseok Kim, Jongseop Kim, Jinwoo Kim
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09199
Pdf URL: https://arxiv.org/pdf/2504.09199
Copy Paste: [[2504.09199]] Illusion Worlds: Deceptive UI Attacks in Social VR(https://arxiv.org/abs/2504.09199)
Keywords: security, attack
Abstract: Social Virtual Reality (VR) platforms have surged in popularity, yet their security risks remain underexplored. This paper presents four novel UI attacks that covertly manipulate users into performing harmful actions through deceptive virtual content. Implemented on VRChat and validated in an IRB-approved study with 30 participants, these attacks demonstrate how deceptive elements can mislead users into malicious actions without their awareness. To address these vulnerabilities, we propose MetaScanner, a proactive countermeasure that rapidly analyzes objects and scripts in virtual worlds, detecting suspicious elements within seconds.

Title: From Visual Explanations to Counterfactual Explanations with Latent Diffusion

Authors: Tung Luu, Nam Le, Duc Le, Bac Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09202
Pdf URL: https://arxiv.org/pdf/2504.09202
Copy Paste: [[2504.09202]] From Visual Explanations to Counterfactual Explanations with Latent Diffusion(https://arxiv.org/abs/2504.09202)
Keywords: attack, robust, diffusion
Abstract: Visual counterfactual explanations are ideal hypothetical images that change the decision-making of the classifier with high confidence toward the desired class while remaining visually plausible and close to the initial image. In this paper, we propose a new approach to tackle two key challenges in recent prominent works: i) determining which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class, and ii) supplying valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model. Our method identifies the essential region for modification through algorithms that provide visual explanations, and then our framework generates realistic counterfactual explanations by combining adversarial attacks based on pruning the adversarial gradient of the target classifier and the latent diffusion model. The proposed method outperforms previous state-of-the-art results on various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our method can be applied to arbitrary classifiers, highlight the strong association between visual and counterfactual explanations, make semantically meaningful changes from the target classifier, and provide observers with subtle counterfactual images.

Title: AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images

Authors: Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, Biplab Banerjee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09203
Pdf URL: https://arxiv.org/pdf/2504.09203
Copy Paste: [[2504.09203]] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images(https://arxiv.org/abs/2504.09203)
Keywords: robust, segmentation
Abstract: Image segmentation beyond predefined categories is a key challenge in remote sensing, where novel and unseen classes often emerge during inference. Open-vocabulary image Segmentation addresses these generalization issues in traditional supervised segmentation models while reducing reliance on extensive per-pixel annotations, which are both expensive and labor-intensive to obtain. Most Open-Vocabulary Segmentation (OVS) methods are designed for natural images but struggle with remote sensing data due to scale variations, orientation changes, and complex scene compositions. This necessitates the development of OVS approaches specifically tailored for remote sensing. In this context, we propose AerOSeg, a novel OVS approach for remote sensing data. First, we compute robust image-text correlation features using multiple rotated versions of the input image and domain-specific prompts. These features are then refined through spatial and class refinement blocks. Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features. Additionally, we introduce a semantic back-projection module and loss to ensure the seamless propagation of SAM's semantic information throughout the segmentation pipeline. Finally, we enhance the refined correlation features using a multi-scale attention-aware decoder to produce the final segmentation map. We validate our SAM-guided Open-Vocabulary Remote Sensing Segmentation model on three benchmark remote sensing datasets: iSAID, DLRSD, and OpenEarthMap. Our model outperforms state-of-the-art open-vocabulary segmentation methods, achieving an average improvement of 2.54 h-mIoU.

Title: Query-based Knowledge Transfer for Heterogeneous Learning Environments

Authors: Norah Alballa, Wenxuan Zhang, Ziquan Liu, Ahmed M. Abdelmoniem, Mohamed Elhoseiny, Marco Canini
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09205
Pdf URL: https://arxiv.org/pdf/2504.09205
Copy Paste: [[2504.09205]] Query-based Knowledge Transfer for Heterogeneous Learning Environments(https://arxiv.org/abs/2504.09205)
Keywords: privacy, federate, data-free
Abstract: Decentralized collaborative learning under data heterogeneity and privacy constraints has rapidly advanced. However, existing solutions like federated learning, ensembles, and transfer learning, often fail to adequately serve the unique needs of clients, especially when local data representation is limited. To address this issue, we propose a novel framework called Query-based Knowledge Transfer (QKT) that enables tailored knowledge acquisition to fulfill specific client needs without direct data exchange. QKT employs a data-free masking strategy to facilitate communication-efficient query-focused knowledge transfer while refining task-specific parameters to mitigate knowledge interference and forgetting. Our experiments, conducted on both standard and clinical benchmarks, show that QKT significantly outperforms existing collaborative learning methods by an average of 20.91\% points in single-class query settings and an average of 14.32\% points in multi-class query scenarios. Further analysis and ablation studies reveal that QKT effectively balances the learning of new and existing knowledge, showing strong potential for its application in decentralized learning.

Title: FairACE: Achieving Degree Fairness in Graph Neural Networks via Contrastive and Adversarial Group-Balanced Training

Authors: Jiaxin Liu, Xiaoqian Jiang, Cangqi Zhou, Jing Zhang
Subjects: cs.LG, cs.AI, cs.SI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09210
Pdf URL: https://arxiv.org/pdf/2504.09210
Copy Paste: [[2504.09210]] FairACE: Achieving Degree Fairness in Graph Neural Networks via Contrastive and Adversarial Group-Balanced Training(https://arxiv.org/abs/2504.09210)
Keywords: fair
Abstract: Fairness has been a significant challenge in graph neural networks (GNNs) since degree biases often result in un-equal prediction performance among nodes with varying degrees. Existing GNN models focus on prediction accuracy, frequently overlooking fairness across different degree groups. To addressthis issue, we propose a novel GNN framework, namely Fairness- Aware Asymmetric Contrastive Ensemble (FairACE), which inte-grates asymmetric contrastive learning with adversarial training to improve degree fairness. FairACE captures one-hop local neighborhood information and two-hop monophily similarity to create fairer node representations and employs a degree fairness regulator to balance performance between high-degree and low-degree nodes. During model training, a novel group-balanced fairness loss is proposed to minimize classification disparities across degree groups. In addition, we also propose a novel fairness metric, the Accuracy Distribution Gap (ADG), which can quantitatively assess and ensure equitable performance across different degree-based node groups. Experimental results on both synthetic and real-world datasets demonstrate that FairACE significantly improves degree fairness metrics while maintaining competitive accuracy in comparison to the state-of-the-art GNN models.

Title: Accurate Diagnosis of Respiratory Viruses Using an Explainable Machine Learning with Mid-Infrared Biomolecular Fingerprinting of Nasopharyngeal Secretions

Authors: Wenwen Zhang, Zhouzhuo Tang, Yingmei Feng, Xia Yu, Qi Jie Wang, Zhiping Lin
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.09211
Pdf URL: https://arxiv.org/pdf/2504.09211
Copy Paste: [[2504.09211]] Accurate Diagnosis of Respiratory Viruses Using an Explainable Machine Learning with Mid-Infrared Biomolecular Fingerprinting of Nasopharyngeal Secretions(https://arxiv.org/abs/2504.09211)
Keywords: interpretability, transformer
Abstract: Accurate identification of respiratory viruses (RVs) is critical for outbreak control and public health. This study presents a diagnostic system that combines Attenuated Total Reflectance Fourier Transform Infrared Spectroscopy (ATR-FTIR) from nasopharyngeal secretions with an explainable Rotary Position Embedding-Sparse Attention Transformer (RoPE-SAT) model to accurately identify multiple RVs within 10 minutes. Spectral data (4000-00 cm-1) were collected, and the bio-fingerprint region (1800-900 cm-1) was employed for analysis. Standard normal variate (SNV) normalization and second-order derivation were applied to reduce scattering and baseline drift. Gradient-weighted class activation mapping (Grad-CAM) was employed to generate saliency maps, highlighting spectral regions most relevant to classification and enhancing the interpretability of model outputs. Two independent cohorts from Beijing Youan Hospital, processed with different viral transport media (VTMs) and drying methods, were evaluated, with one including influenza B, SARS-CoV-2, and healthy controls, and the other including mycoplasma, SARS-CoV-2, and healthy controls. The model achieved sensitivity and specificity above 94.40% across both cohorts. By correlating model-selected infrared regions with known biomolecular signatures, we verified that the system effectively recognizes virus-specific spectral fingerprints, including lipids, Amide I, Amide II, Amide III, nucleic acids, and carbohydrates, and leverages their weighted contributions for accurate classification.

Title: Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

Authors: Zhicheng Zhang, Hao Tang, Jinhui Tang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09215
Pdf URL: https://arxiv.org/pdf/2504.09215
Copy Paste: [[2504.09215]] Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition(https://arxiv.org/abs/2504.09215)
Keywords: transformer
Abstract: Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

Title: DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Authors: Wenjin Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09223
Pdf URL: https://arxiv.org/pdf/2504.09223
Copy Paste: [[2504.09223]] DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models(https://arxiv.org/abs/2504.09223)
Keywords: large language model
Abstract: Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.

Title: Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

Authors: You Wu, Xucheng Wang, Xiangyang Yang, Mengyuan Liu, Dan Zeng, Hengzhou Ye, Shuiwang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09228
Pdf URL: https://arxiv.org/pdf/2504.09228
Copy Paste: [[2504.09228]] Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking(https://arxiv.org/abs/2504.09228)
Keywords: robust, transformer
Abstract: Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at this https URL.

Title: Type-Constrained Code Generation with Language Models

Authors: Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, Martin Vechev
Subjects: cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2504.09246
Pdf URL: https://arxiv.org/pdf/2504.09246
Copy Paste: [[2504.09246]] Type-Constrained Code Generation with Language Models(https://arxiv.org/abs/2504.09246)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic language features. This leaves typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. We develop novel prefix automata for this purpose and introduce a sound approach to enforce well-typedness based on type inference and a search over inhabitable types. We formalize our approach on a simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on HumanEval shows that our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including SOTA open-weight models with more than 30B parameters.

Title: Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Danping Zou, Weiyao Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09261
Pdf URL: https://arxiv.org/pdf/2504.09261
Copy Paste: [[2504.09261]] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling(https://arxiv.org/abs/2504.09261)
Keywords: large language model
Abstract: Visual Autoregressive (VAR) models have emerged as a powerful approach for multi-modal content creation, offering high efficiency and quality across diverse multimedia applications. However, they face significant memory bottlenecks due to extensive KV cache accumulation during inference. Existing KV cache compression techniques for large language models are suboptimal for VAR models due to, as we identify in this paper, two distinct categories of attention heads in VAR models: Structural Heads, which preserve spatial coherence through diagonal attention patterns, and Contextual Heads, which maintain semantic consistency through vertical attention patterns. These differences render single-strategy KV compression techniques ineffective for VAR models. To address this, we propose HACK, a training-free Head-Aware Compression method for KV cache. HACK allocates asymmetric cache budgets and employs pattern-specific compression strategies tailored to the essential characteristics of each head category. Experiments on Infinity-2B, Infinity-8B, and VAR-d30 demonstrate its effectiveness in text-to-image and class-conditional generation tasks. HACK can hack down up to 50\% and 70\% of cache with minimal performance degradation for VAR-d30 and Infinity-8B, respectively. Even with 70\% and 90\% KV cache compression in VAR-d30 and Infinity-8B, HACK still maintains high-quality generation while reducing memory usage by 44.2\% and 58.9\%, respectively.

Title: Mixture of Group Experts for Learning Invariant Representations

Authors: Lei Kang, Jia Li, Mi Tian, Hua Huang
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.09265
Pdf URL: https://arxiv.org/pdf/2504.09265
Copy Paste: [[2504.09265]] Mixture of Group Experts for Learning Invariant Representations(https://arxiv.org/abs/2504.09265)
Keywords: transformer
Abstract: Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.

Title: VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro

Authors: Zheyuan Zhang, Monica Dou, Linkai Peng, Hongyi Pan, Ulas Bagci, Boqing Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09282
Pdf URL: https://arxiv.org/pdf/2504.09282
Copy Paste: [[2504.09282]] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro(https://arxiv.org/abs/2504.09282)
Keywords: large language model
Abstract: Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at this https URL.

Title: A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

Authors: Tinh-Anh Nguyen-Nhu, Huu-Loc Tran, Nguyen-Khang Le, Minh-Nhat Nguyen, Tien-Huy Nguyen, Hoang-Long Nguyen-Huu, Huu-Phong Phan-Nguyen, Huy-Thach Pham, Quan Nguyen, Hoang M. Le, Quang-Vinh Dinh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09298
Pdf URL: https://arxiv.org/pdf/2504.09298
Copy Paste: [[2504.09298]] A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search(https://arxiv.org/abs/2504.09298)
Keywords: robust, extraction
Abstract: The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.

Title: Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

Authors: Owen Patterson, Chee Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09305
Pdf URL: https://arxiv.org/pdf/2504.09305
Copy Paste: [[2504.09305]] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation(https://arxiv.org/abs/2504.09305)
Keywords: robust, large language model
Abstract: In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.

Title: Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts

Authors: Emily Johnson, Xavier Holt, Noah Wilson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09309
Pdf URL: https://arxiv.org/pdf/2504.09309
Copy Paste: [[2504.09309]] Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts(https://arxiv.org/abs/2504.09309)
Keywords: transformer, large language model
Abstract: Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.

Title: SmartShift: A Secure and Efficient Approach to Smart Contract Migration

Authors: Tahrim Hossain, Faisal Haque Bappy, Tarannum Shaila Zaman, Raiful Hasan, Tariqul Islam
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09315
Pdf URL: https://arxiv.org/pdf/2504.09315
Copy Paste: [[2504.09315]] SmartShift: A Secure and Efficient Approach to Smart Contract Migration(https://arxiv.org/abs/2504.09315)
Keywords: secure, security, robust
Abstract: Blockchain and smart contracts have emerged as revolutionary technologies transforming distributed computing. While platform evolution and smart contracts' inherent immutability necessitate migrations both across and within chains, migrating the vast amounts of critical data in these contracts while maintaining data integrity and minimizing operational disruption presents a significant challenge. To address these challenges, we present SmartShift, a framework that enables secure and efficient smart contract migrations through intelligent state partitioning and progressive function activation, preserving operational continuity during transitions. Our comprehensive evaluation demonstrates that SmartShift significantly reduces migration downtime while ensuring robust security, establishing a foundation for efficient and secure smart contract migration systems.

Title: CrossLink: A Decentralized Framework for Secure Cross-Chain Smart Contract Execution

Authors: Tahrim Hossain, Faisal Haque Bappy, Tarannum Shaila Zaman, Tariqul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09319
Pdf URL: https://arxiv.org/pdf/2504.09319
Copy Paste: [[2504.09319]] CrossLink: A Decentralized Framework for Secure Cross-Chain Smart Contract Execution(https://arxiv.org/abs/2504.09319)
Keywords: secure, security, attack, robust
Abstract: This paper introduces CrossLink, a decentralized framework for secure cross-chain smart contract execution that effectively addresses the inherent limitations of contemporary solutions, which primarily focus on asset transfers and rely on potentially vulnerable centralized intermediaries. Recognizing the escalating demand for seamless interoperability among decentralized applications, CrossLink provides a trustless mechanism for smart contracts across disparate blockchain networks to communicate and interact. At its core, CrossLink utilizes a compact chain for selectively storing authorized contract states and employs a secure inter-chain messaging mechanism to ensure atomic execution and data consistency. By implementing a deposit/collateral fee system and efficient state synchronization, CrossLink enhances security and mitigates vulnerabilities, offering a novel approach to seamless, secure, and decentralized cross-chain interoperability. A formal security analysis further validates CrossLink's robustness against unauthorized modifications and denial-of-service attacks.

Title: MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

Authors: Tyler Spears, Shen Zhu, Yinzhu Jin, Aman Shrivastava, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09322
Pdf URL: https://arxiv.org/pdf/2504.09322
Copy Paste: [[2504.09322]] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions(https://arxiv.org/abs/2504.09322)
Keywords: diffusion, generative
Abstract: In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.

Title: Text To 3D Object Generation For Scalable Room Assembly

Authors: Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09328
Pdf URL: https://arxiv.org/pdf/2504.09328
Copy Paste: [[2504.09328]] Text To 3D Object Generation For Scalable Room Assembly(https://arxiv.org/abs/2504.09328)
Keywords: robust, diffusion
Abstract: Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Title: Efficient Implementation of Reinforcement Learning over Homomorphic Encryption

Authors: Jihoon Suh, Takashi Tanaka
Subjects: cs.LG, cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2504.09335
Pdf URL: https://arxiv.org/pdf/2504.09335
Copy Paste: [[2504.09335]] Efficient Implementation of Reinforcement Learning over Homomorphic Encryption(https://arxiv.org/abs/2504.09335)
Keywords: secure, privacy
Abstract: We investigate encrypted control policy synthesis over the cloud. While encrypted control implementations have been studied previously, we focus on the less explored paradigm of privacy-preserving control synthesis, which can involve heavier computations ideal for cloud outsourcing. We classify control policy synthesis into model-based, simulator-driven, and data-driven approaches and examine their implementation over fully homomorphic encryption (FHE) for privacy enhancements. A key challenge arises from comparison operations (min or max) in standard reinforcement learning algorithms, which are difficult to execute over encrypted data. This observation motivates our focus on Relative-Entropy-regularized reinforcement learning (RL) problems, which simplifies encrypted evaluation of synthesis algorithms due to their comparison-free structures. We demonstrate how linearly solvable value iteration, path integral control, and Z-learning can be readily implemented over FHE. We conduct a case study of our approach through numerical simulations of encrypted Z-learning in a grid world environment using the CKKS encryption scheme, showing convergence with acceptable approximation error. Our work suggests the potential for secure and efficient cloud-based reinforcement learning.

Title: Towards Optimal Differentially Private Regret Bounds in Linear MDPs

Authors: Sharan Sahu
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09339
Pdf URL: https://arxiv.org/pdf/2504.09339
Copy Paste: [[2504.09339]] Towards Optimal Differentially Private Regret Bounds in Linear MDPs(https://arxiv.org/abs/2504.09339)
Keywords: privacy
Abstract: We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping $\phi(s, a)$, and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves $\widetilde{O}(\sqrt{d^3 H^4 K})$ regret in the non-private setting. Building on recent advances that improve this to minimax optimal regret $\widetilde{O}(HD\sqrt{K})$ via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of $\widetilde{O}(d \sqrt{H^3 K} + H^{4.5} d^{7/6} K^{1/2} / \epsilon)$, improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.

Title: Context-Aware Adaptive Sampling for Intelligent Data Acquisition Systems Using DQN

Authors: Weiqiang Huang, Juecen Zhan, Yumeng Sun, Xu Han, Tai An, Nan Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09344
Pdf URL: https://arxiv.org/pdf/2504.09344
Copy Paste: [[2504.09344]] Context-Aware Adaptive Sampling for Intelligent Data Acquisition Systems Using DQN(https://arxiv.org/abs/2504.09344)
Keywords: robust
Abstract: Multi-sensor systems are widely used in the Internet of Things, environmental monitoring, and intelligent manufacturing. However, traditional fixed-frequency sampling strategies often lead to severe data redundancy, high energy consumption, and limited adaptability, failing to meet the dynamic sensing needs of complex environments. To address these issues, this paper proposes a DQN-based multi-sensor adaptive sampling optimization method. By leveraging a reinforcement learning framework to learn the optimal sampling strategy, the method balances data quality, energy consumption, and redundancy. We first model the multi-sensor sampling task as a Markov Decision Process (MDP), then employ a Deep Q-Network to optimize the sampling policy. Experiments on the Intel Lab Data dataset confirm that, compared with fixed-frequency sampling, threshold-triggered sampling, and other reinforcement learning approaches, DQN significantly improves data quality while lowering average energy consumption and redundancy rates. Moreover, in heterogeneous multi-sensor environments, DQN-based adaptive sampling shows enhanced robustness, maintaining superior data collection performance even in the presence of interference factors. These findings demonstrate that DQN-based adaptive sampling can enhance overall data acquisition efficiency in multi-sensor systems, providing a new solution for efficient and intelligent sensing.

Title: REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

Authors: Duy-Cat Can, Quang-Huy Tang, Huong Ha, Binh T. Nguyen, Oliver Y. Chén
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.09354
Pdf URL: https://arxiv.org/pdf/2504.09354
Copy Paste: [[2504.09354]] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis(https://arxiv.org/abs/2504.09354)
Keywords: robust, generative
Abstract: Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.

Title: PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Authors: Jiahuan Long, Tingsong Jiang, Wen Yao, Shuai Jia, Weijia Zhang, Weien Zhou, Chao Ma, Xiaoqian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09361
Pdf URL: https://arxiv.org/pdf/2504.09361
Copy Paste: [[2504.09361]] PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking(https://arxiv.org/abs/2504.09361)
Keywords: attack, robust
Abstract: Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.

Title: Machine Learning-Based Cyberattack Detection and Identification for Automatic Generation Control Systems Considering Nonlinearities

Authors: Nour M. Shabar, Ahmad Mohammad Saber, Deepa Kundur
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09363
Pdf URL: https://arxiv.org/pdf/2504.09363
Copy Paste: [[2504.09363]] Machine Learning-Based Cyberattack Detection and Identification for Automatic Generation Control Systems Considering Nonlinearities(https://arxiv.org/abs/2504.09363)
Keywords: attack
Abstract: Automatic generation control (AGC) systems play a crucial role in maintaining system frequency across power grids. However, AGC systems' reliance on communicated measurements exposes them to false data injection attacks (FDIAs), which can compromise the overall system stability. This paper proposes a machine learning (ML)-based detection framework that identifies FDIAs and determines the compromised measurements. The approach utilizes an ML model trained offline to accurately detect attacks and classify the manipulated signals based on a comprehensive set of statistical and time-series features extracted from AGC measurements before and after disturbances. For the proposed approach, we compare the performance of several powerful ML algorithms. Our results demonstrate the efficacy of the proposed method in detecting FDIAs while maintaining a low false alarm rate, with an F1-score of up to 99.98%, outperforming existing approaches.

Title: QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Authors: Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, Junyi Jessy Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09373
Pdf URL: https://arxiv.org/pdf/2504.09373
Copy Paste: [[2504.09373]] QUDsim: Quantifying Discourse Similarities in LLM-Generated Text(https://arxiv.org/abs/2504.09373)
Keywords: large language model
Abstract: As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.

Title: Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

Authors: Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09377
Pdf URL: https://arxiv.org/pdf/2504.09377
Copy Paste: [[2504.09377]] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers(https://arxiv.org/abs/2504.09377)
Keywords: transformer
Abstract: All-in-one image restoration, which aims to address diverse degradations within a unified framework, is critical for practical applications. However, existing methods rely on predicting and integrating degradation conditions, which can misactivate degradation-specific features in complex scenarios, limiting their restoration performance. To address this issue, we propose a novel all-in-one image restoration framework guided by Histograms of Oriented Gradients (HOG), named HOGformer. By leveraging the degradation-discriminative capability of HOG descriptors, HOGformer employs a dynamic self-attention mechanism that adaptively attends to long-range spatial dependencies based on degradation-aware HOG cues. To enhance the degradation sensitivity of attention inputs, we design a HOG-guided local dynamic-range convolution module that captures long-range degradation similarities while maintaining awareness of global structural information. Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions. Extensive experiments across diverse benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes effectively to complex real-world degradations. Code is available at this https URL.

Title: Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

Authors: Kartik Ravisankar, Hyojung Han, Marine Carpuat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09378
Pdf URL: https://arxiv.org/pdf/2504.09378
Copy Paste: [[2504.09378]] Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs(https://arxiv.org/abs/2504.09378)
Keywords: large language model
Abstract: Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.

Title: Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

Authors: Shengzhe Chen, Zhaoxuan Dong, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09384
Pdf URL: https://arxiv.org/pdf/2504.09384
Copy Paste: [[2504.09384]] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation(https://arxiv.org/abs/2504.09384)
Keywords: robust, segmentation
Abstract: For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow (CF) perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.

Title: Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Authors: Vishakh Padmakumar, Chen Yueh-Han, Jane Pan, Valerie Chen, He He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09389
Pdf URL: https://arxiv.org/pdf/2504.09389
Copy Paste: [[2504.09389]] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models(https://arxiv.org/abs/2504.09389)
Keywords: large language model
Abstract: As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.

Title: Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity, Categorical Perception, and Phase Transitions

Authors: Nooshin Bahador
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09393
Pdf URL: https://arxiv.org/pdf/2504.09393
Copy Paste: [[2504.09393]] Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity, Categorical Perception, and Phase Transitions(https://arxiv.org/abs/2504.09393)
Keywords: transformer
Abstract: This study explored whether Vision Transformers (ViTs) developed orientation and color biases similar to those observed in the human brain. Using synthetic datasets with controlled variations in noise levels, angles, lengths, widths, and colors, we analyzed the behavior of ViTs fine-tuned with LoRA. Our findings revealed four key insights: First, ViTs exhibited an oblique effect showing the lowest angle prediction errors at 180 deg (horizontal) across all conditions. Second, angle prediction errors varied by color. Errors were highest for bluish hues and lowest for yellowish ones. Additionally, clustering analysis of angle prediction errors showed that ViTs grouped colors in a way that aligned with human perceptual categories. In addition to orientation and color biases, we observed phase transition phenomena. While two phase transitions occurred consistently across all conditions, the training loss curves exhibited delayed transitions when color was incorporated as an additional data attribute. Finally, we observed that attention heads in certain layers inherently develop specialized capabilities, functioning as task-agnostic feature extractors regardless of the downstream task. These observations suggest that biases and properties arise primarily from pre-training on the original dataset which shapes the model's foundational representations and the inherent architectural constraints of the vision transformer, rather than being solely determined by downstream data statistics.

Title: Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes

Authors: Stella C. Dong, James R. Finlay
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09396
Pdf URL: https://arxiv.org/pdf/2504.09396
Copy Paste: [[2504.09396]] Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes(https://arxiv.org/abs/2504.09396)
Keywords: robust
Abstract: This paper proposes a reinforcement learning (RL) framework for insurance reserving that integrates tail-risk sensitivity, macroeconomic regime modeling, and regulatory compliance. The reserving problem is formulated as a finite-horizon Markov Decision Process (MDP), in which reserve adjustments are optimized using Proximal Policy Optimization (PPO) subject to Conditional Value-at-Risk (CVaR) constraints. To enhance policy robustness across varying economic conditions, the agent is trained using a regime-aware curriculum that progressively increases volatility exposure. The reward structure penalizes reserve shortfall, capital inefficiency, and solvency floor violations, with design elements informed by Solvency II and Own Risk and Solvency Assessment (ORSA) frameworks. Empirical evaluations on two industry datasets--Workers' Compensation, and Other Liability--demonstrate that the RL-CVaR agent achieves superior performance relative to classical reserving methods across multiple criteria, including tail-risk control (CVaR$_{0.95}$), capital efficiency, and regulatory violation rate. The framework also accommodates fixed-shock stress testing and regime-stratified analysis, providing a principled and extensible approach to reserving under uncertainty.

Title: Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration

Authors: Feijiang Han, Licheng Guo, Hengtao Cui, Zhiyuan Lyu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09402
Pdf URL: https://arxiv.org/pdf/2504.09402
Copy Paste: [[2504.09402]] Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration(https://arxiv.org/abs/2504.09402)
Keywords: large language model
Abstract: Large Language Models (LLMs) often struggle with tasks that require a deep understanding of complex questions, especially when faced with long-range dependencies or multi-step reasoning. This work investigates the limitations of current LLMs in question comprehension and identifies three insights: (1) repeating question tokens improves comprehension by increasing attention to question regions, (2) increased backward dependencies negatively affect performance due to unidirectional attentional constraints, and (3) recalibrating attentional mechanisms to prioritize question-relevant regions improves performance. Based on these findings, we first propose a family of prompt-based strategies - Step-by-Step Reading (SSR), SSR+, and SSR++ - that guide LLMs to incrementally process question tokens and align their reasoning with the input structure. These methods significantly improve performance, with SSR++ achieving state-of-the-art results on several benchmarks: 96.66% on GSM8K, 94.61% on ASDiv, and 76.28% on AQuA. Second, we introduce a training-free attention recalibration mechanism that dynamically adjusts attention distributions during inference to emphasize question-relevant regions. This method improves the accuracy of LLaMA 3.1-8B on AQuA by 5.17% without changing model parameters or input prompts. Taken together, our results highlight the importance of structured prompt design and attention optimization in improving LLM comprehension, providing lightweight yet effective tools for improving performance in various NLP tasks.

Title: UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

Authors: Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.09407
Pdf URL: https://arxiv.org/pdf/2504.09407
Copy Paste: [[2504.09407]] UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents(https://arxiv.org/abs/2504.09407)
Keywords: large language model
Abstract: Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but\textbf{ how to evaluate and iterate the usability testing study design } itself? Recent advances in Large Language Model-simulated Agent (\textbf{LLM Agent}) research inspired us to design \textbf{UXAgent} to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.

Title: Nash Equilibrium Between Consumer Electronic Devices and DoS Attacker for Distributed IoT-enabled RSE Systems

Authors: Gengcan Chen, Donghong Cai, Zahid Khan, Jawad Ahmad, Wadii Boulila
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2504.09415
Pdf URL: https://arxiv.org/pdf/2504.09415
Copy Paste: [[2504.09415]] Nash Equilibrium Between Consumer Electronic Devices and DoS Attacker for Distributed IoT-enabled RSE Systems(https://arxiv.org/abs/2504.09415)
Keywords: attack, robust
Abstract: In electronic consumer Internet of Things (IoT), consumer electronic devices as edge devices require less computational overhead and the remote state estimation (RSE) of consumer electronic devices is always at risk of denial-of-service (DoS) attacks. Therefore, the adversarial strategy between consumer electronic devices and DoS attackers is critical. This paper focuses on the adversarial strategy between consumer electronic devices and DoS attackers in IoT-enabled RSE Systems. We first propose a remote joint estimation model for distributed measurements to effectively reduce consumer electronic device workload and minimize data leakage risks. The Kalman filter is deployed on the remote estimator, and the DoS attacks with open-loop as well as closed-loop are considered. We further introduce advanced reinforcement learning techniques, including centralized and distributed Minimax-DQN, to address high-dimensional decision-making challenges in both open-loop and closed-loop scenarios. Especially, the Q-network instead of the Q-table is used in the proposed approaches, which effectively solves the challenge of Q-learning. Moreover, the proposed distributed Minimax-DQN reduces the action space to expedite the search for Nash Equilibrium (NE). The experimental results validate that the proposed model can expeditiously restore the RSE error covariance to a stable state in the presence of DoS attacks, exhibiting notable attack robustness. The proposed centralized and distributed Minimax-DQN effectively resolves the NE in both open and closed-loop case, showcasing remarkable performance in terms of convergence. It reveals that substantial advantages in both efficiency and stability are achieved compared with the state-of-the-art methods.

Title: SaRO: Enhancing LLM Safety through Reasoning-based Alignment

Authors: Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09420
Pdf URL: https://arxiv.org/pdf/2504.09420
Copy Paste: [[2504.09420]] SaRO: Enhancing LLM Safety through Reasoning-based Alignment(https://arxiv.org/abs/2504.09420)
Keywords: attack, large language model
Abstract: Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

Title: ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model

Authors: Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09421
Pdf URL: https://arxiv.org/pdf/2504.09421
Copy Paste: [[2504.09421]] ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model(https://arxiv.org/abs/2504.09421)
Keywords: large language model
Abstract: Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at this https URL.

Title: BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Authors: Shengao Wang, Arjun Chandra, Aoming Liu, Venkatesh Saligrama, Boqing Gong
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.09426
Pdf URL: https://arxiv.org/pdf/2504.09426
Copy Paste: [[2504.09426]] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning(https://arxiv.org/abs/2504.09426)
Keywords: robust
Abstract: Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

Title: Ensemble-Enhanced Graph Autoencoder with GAT and Transformer-Based Encoders for Robust Fault Diagnosis

Authors: Moirangthem Tiken Singh
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2504.09427
Pdf URL: https://arxiv.org/pdf/2504.09427
Copy Paste: [[2504.09427]] Ensemble-Enhanced Graph Autoencoder with GAT and Transformer-Based Encoders for Robust Fault Diagnosis(https://arxiv.org/abs/2504.09427)
Keywords: robust, transformer, segmentation
Abstract: Fault classification in industrial machinery is vital for enhancing reliability and reducing downtime, yet it remains challenging due to the variability of vibration patterns across diverse operating conditions. This study introduces a novel graph-based framework for fault classification, converting time-series vibration data from machinery operating at varying horsepower levels into a graph representation. We utilize Shannon's entropy to determine the optimal window size for data segmentation, ensuring each segment captures significant temporal patterns, and employ Dynamic Time Warping (DTW) to define graph edges based on segment similarity. A Graph Auto Encoder (GAE) with a deep graph transformer encoder, decoder, and ensemble classifier is developed to learn latent graph representations and classify faults across various categories. The GAE's performance is evaluated on the Case Western Reserve University (CWRU) dataset, with cross-dataset generalization assessed on the HUST dataset. Results show that GAE achieves a mean F1-score of 0.99 on the CWRU dataset, significantly outperforming baseline models-CNN, LSTM, RNN, GRU, and Bi-LSTM (F1-scores: 0.94-0.97, p < 0.05, Wilcoxon signed-rank test for Bi-LSTM: p < 0.05) -- particularly in challenging classes (e.g., Class 8: 0.99 vs. 0.71 for Bi-LSTM). Visualization of dataset characteristics reveals that datasets with amplified vibration patterns and diverse fault dynamics enhance generalization. This framework provides a robust solution for fault diagnosis under varying conditions, offering insights into dataset impacts on model performance.

Title: Constants of motion network revisited

Authors: Wenqi Fang, Chao Chen, Yongkui Yang, Zheng Wang
Subjects: cs.LG, physics.class-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09434
Pdf URL: https://arxiv.org/pdf/2504.09434
Copy Paste: [[2504.09434]] Constants of motion network revisited(https://arxiv.org/abs/2504.09434)
Keywords: robust
Abstract: Discovering constants of motion is meaningful in helping understand the dynamical systems, but inevitably needs proficient mathematical skills and keen analytical capabilities. With the prevalence of deep learning, methods employing neural networks, such as Constant Of Motion nETwork (COMET), are promising in handling this scientific problem. Although the COMET method can produce better predictions on dynamics by exploiting the discovered constants of motion, there is still plenty of room to sharpen it. In this paper, we propose a novel neural network architecture, built using the singular-value-decomposition (SVD) technique, and a two-phase training algorithm to improve the performance of COMET. Extensive experiments show that our approach not only retains the advantages of COMET, such as applying to non-Hamiltonian systems and indicating the number of constants of motion, but also can be more lightweight and noise-robust than COMET.

Title: PLS-Assisted Offloading for Edge Computing-Enabled Post-Quantum Security in Resource-Constrained Devices

Authors: Hamid Amiriara, Mahtab Mirmohseni, Rahim Tafazolli
Subjects: cs.CR, eess.SP
Abstract URL: https://arxiv.org/abs/2504.09437
Pdf URL: https://arxiv.org/pdf/2504.09437
Copy Paste: [[2504.09437]] PLS-Assisted Offloading for Edge Computing-Enabled Post-Quantum Security in Resource-Constrained Devices(https://arxiv.org/abs/2504.09437)
Keywords: secure, security
Abstract: With the advent of post-quantum cryptography (PQC) standards, it has become imperative for resource-constrained devices (RCDs) in the Internet of Things (IoT) to adopt these quantum-resistant protocols. However, the high computational overhead and the large key sizes associated with PQC make direct deployment on such devices impractical. To address this challenge, we propose an edge computing-enabled PQC framework that leverages a physical-layer security (PLS)-assisted offloading strategy, allowing devices to either offload intensive cryptographic tasks to a post-quantum edge server (PQES) or perform them locally. Furthermore, to ensure data confidentiality within the edge domain, our framework integrates two PLS techniques: offloading RCDs employ wiretap coding to secure data transmission, while non-offloading RCDs serve as friendly jammers by broadcasting artificial noise to disrupt potential eavesdroppers. Accordingly, we co-design the computation offloading and PLS strategy by jointly optimizing the device transmit power, PQES computation resource allocation, and offloading decisions to minimize overall latency under resource constraints. Numerical results demonstrate significant latency reductions compared to baseline schemes, confirming the scalability and efficiency of our approach for secure PQC operations in IoT networks.

Title: Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.09441
Pdf URL: https://arxiv.org/pdf/2504.09441
Copy Paste: [[2504.09441]] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance(https://arxiv.org/abs/2504.09441)
Keywords: diffusion
Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.

Title: aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis

Authors: Trung Ha, Joonas Nättilä, Jordy Davelaar
Subjects: cs.LG, astro-ph.IM, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09449
Pdf URL: https://arxiv.org/pdf/2504.09449
Copy Paste: [[2504.09449]] aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis(https://arxiv.org/abs/2504.09449)
Keywords: robust
Abstract: We introduce aweSOM, an open-source Python package for machine learning (ML) clustering and classification, using a Self-organizing Maps (SOM) algorithm that incorporates CPU/GPU acceleration to accommodate large ($N > 10^6$, where $N$ is the number of data points), multidimensional datasets. aweSOM consists of two main modules, one that handles the initialization and training of the SOM, and another that stacks the results of multiple SOM realizations to obtain more statistically robust clusters. Existing Python-based SOM implementations (e.g., POPSOM, Yuan (2018); MiniSom, Vettigli (2018); sklearn-som) primarily serve as proof-of-concept demonstrations, optimized for smaller datasets, but lacking scalability for large, multidimensional data. aweSOM provides a solution for this gap in capability, with good performance scaling up to $\sim 10^8$ individual points, and capable of utilizing multiple features per point. We compare the code performance against the legacy implementations it is based on, and find a 10-100x speed up, as well as significantly improved memory efficiency, due to several built-in optimizations.

Title: FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

Authors: Tianyi Wang, Harry Cheng, Ming-Hui Liu, Mohan Kankanhalli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09451
Pdf URL: https://arxiv.org/pdf/2504.09451
Copy Paste: [[2504.09451]] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks(https://arxiv.org/abs/2504.09451)
Keywords: robust, explainability, watermark
Abstract: Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.

Title: D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Authors: Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09454
Pdf URL: https://arxiv.org/pdf/2504.09454
Copy Paste: [[2504.09454]] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation(https://arxiv.org/abs/2504.09454)
Keywords: diffusion, transformer
Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at this https URL.

Title: Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach

Authors: Mikael Makonnen, Moritz Vandenhirtz, Sonia Laguna, Julia E Vogt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09459
Pdf URL: https://arxiv.org/pdf/2504.09459
Copy Paste: [[2504.09459]] Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach(https://arxiv.org/abs/2504.09459)
Keywords: interpretability
Abstract: Concept Bottleneck Models (CBMs) aim to enhance interpretability by structuring predictions around human-understandable concepts. However, unintended information leakage, where predictive signals bypass the concept bottleneck, compromises their transparency. This paper introduces an information-theoretic measure to quantify leakage in CBMs, capturing the extent to which concept embeddings encode additional, unintended information beyond the specified concepts. We validate the measure through controlled synthetic experiments, demonstrating its effectiveness in detecting leakage trends across various configurations. Our findings highlight that feature and concept dimensionality significantly influence leakage, and that classifier choice impacts measurement stability, with XGBoost emerging as the most reliable estimator. Additionally, preliminary investigations indicate that the measure exhibits the anticipated behavior when applied to soft joint CBMs, suggesting its reliability in leakage quantification beyond fully synthetic settings. While this study rigorously evaluates the measure in controlled synthetic experiments, future work can extend its application to real-world datasets.

Title: AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Authors: Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2504.09466
Pdf URL: https://arxiv.org/pdf/2504.09466
Copy Paste: [[2504.09466]] AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender(https://arxiv.org/abs/2504.09466)
Keywords: protect, defense, attack, robust, large language model
Abstract: Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

Title: CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Authors: Pooja Guhan, Divya Kothandaraman, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09472
Pdf URL: https://arxiv.org/pdf/2504.09472
Copy Paste: [[2504.09472]] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models(https://arxiv.org/abs/2504.09472)
Keywords: diffusion
Abstract: We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.

Title: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Authors: Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09480
Pdf URL: https://arxiv.org/pdf/2504.09480
Copy Paste: [[2504.09480]] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation(https://arxiv.org/abs/2504.09480)
Keywords: segmentation
Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at this https URL.

Title: HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs

Authors: Sharanya Dasgupta, Sujoy Nath, Arkaprabha Basu, Pourya Shamsolmoali, Swagatam Das
Subjects: cs.CL, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2504.09482
Pdf URL: https://arxiv.org/pdf/2504.09482
Copy Paste: [[2504.09482]] HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs(https://arxiv.org/abs/2504.09482)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at this https URL.

Title: An overview of condensation phenomenon in deep learning

Authors: Zhi-Qin John Xu, Yaoyu Zhang, Zhangchen Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09484
Pdf URL: https://arxiv.org/pdf/2504.09484
Copy Paste: [[2504.09484]] An overview of condensation phenomenon in deep learning(https://arxiv.org/abs/2504.09484)
Keywords: transformer
Abstract: In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

Title: GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model

Authors: Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2504.09485
Pdf URL: https://arxiv.org/pdf/2504.09485
Copy Paste: [[2504.09485]] GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model(https://arxiv.org/abs/2504.09485)
Keywords: generative, large language model
Abstract: The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces, which restricts their ability to complement each other for more advanced applications. In this work, we present GenEDA, the first framework that aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representations and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms that support both open-source trainable LLMs and commercial frozen LLMs. Built on this aligned architecture, GenEDA enables three unprecedented generative reasoning tasks over netlists, where the model reversely generates the high-level functionality from low-level netlists in different granularities. These tasks extend traditional gate-type prediction to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs' (e.g., GPT-4o and DeepSeek-V3) performance in all tasks.

Title: Kongzi: A Historical Large Language Model with Fact Enhancement

Authors: Jiashu Yang, Ningning Wang, Yian Zhao, Chaoran Feng, Junjia Du, Hao Pang, Zhirui Fang, Xuxin Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09488
Pdf URL: https://arxiv.org/pdf/2504.09488
Copy Paste: [[2504.09488]] Kongzi: A Historical Large Language Model with Fact Enhancement(https://arxiv.org/abs/2504.09488)
Keywords: large language model
Abstract: The capabilities of the latest large language models (LLMs) have been extended from pure natural language understanding to complex reasoning tasks. However, current reasoning models often exhibit factual inaccuracies in longer reasoning chains, which poses challenges for historical reasoning and limits the potential of LLMs in complex, knowledge-intensive tasks. Historical studies require not only the accurate presentation of factual information but also the ability to establish cross-temporal correlations and derive coherent conclusions from fragmentary and often ambiguous sources. To address these challenges, we propose Kongzi, a large language model specifically designed for historical analysis. Through the integration of curated, high-quality historical data and a novel fact-reinforcement learning strategy, Kongzi demonstrates strong factual alignment and sophisticated reasoning depth. Extensive experiments on tasks such as historical question answering and narrative generation demonstrate that Kongzi outperforms existing models in both factual accuracy and reasoning depth. By effectively addressing the unique challenges inherent in historical texts, Kongzi sets a new standard for the development of accurate and reliable LLMs in professional domains.

Title: Federated Prototype Graph Learning

Authors: Zhengyu Wu, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang, Chenghu Zhou
Subjects: cs.LG, cs.AI, cs.DB, cs.SI
Abstract URL: https://arxiv.org/abs/2504.09493
Pdf URL: https://arxiv.org/pdf/2504.09493
Copy Paste: [[2504.09493]] Federated Prototype Graph Learning(https://arxiv.org/abs/2504.09493)
Keywords: privacy, federate
Abstract: In recent years, Federated Graph Learning (FGL) has gained significant attention for its distributed training capabilities in graph-based machine intelligence applications, mitigating data silos while offering a new perspective for privacy-preserve large-scale graph learning. However, multi-level FGL heterogeneity presents various client-server collaboration challenges: (1) Model-level: The variation in clients for expected performance and scalability necessitates the deployment of heterogeneous models. Unfortunately, most FGL methods rigidly demand identical client models due to the direct model weight aggregation on the server. (2) Data-level: The intricate nature of graphs, marked by the entanglement of node profiles and topology, poses an optimization dilemma. This implies that models obtained by federated training struggle to achieve superior performance. (3) Communication-level: Some FGL methods attempt to increase message sharing among clients or between clients and the server to improve training, which inevitably leads to high communication costs. In this paper, we propose FedPG as a general prototype-guided optimization method for the above multi-level FGL heterogeneity. Specifically, on the client side, we integrate multi-level topology-aware prototypes to capture local graph semantics. Subsequently, on the server side, leveraging the uploaded prototypes, we employ topology-guided contrastive learning and personalized technology to tailor global prototypes for each client, broadcasting them to improve local training. Experiments demonstrate that FedPG outperforms SOTA baselines by an average of 3.57\% in accuracy while reducing communication costs by 168x.

Title: EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance

Authors: Yue Yang, Christoph Leuze, Brian Hargreaves, Bruce Daniel, Fred Baik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09498
Pdf URL: https://arxiv.org/pdf/2504.09498
Copy Paste: [[2504.09498]] EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance(https://arxiv.org/abs/2504.09498)
Keywords: robust
Abstract: The use of Augmented Reality (AR) devices for surgical guidance has gained increasing traction in the medical field. Traditional registration methods often rely on external fiducial markers to achieve high accuracy and real-time performance. However, these markers introduce cumbersome calibration procedures and can be challenging to deploy in clinical settings. While commercial solutions have attempted real-time markerless tracking using the native RGB cameras of AR devices, their accuracy remains questionable for medical guidance, primarily due to occlusions and significant outliers between the live sensor data and the preoperative target anatomy point cloud derived from MRI or CT scans. In this work, we present a markerless framework that relies only on the depth sensor of AR devices and consists of two modules: a registration module for high-precision, outlier-robust target anatomy localization, and a tracking module for real-time pose estimation. The registration module integrates depth sensor error correction, a human-in-the-loop region filtering technique, and a robust global alignment with curvature-aware feature sampling, followed by local ICP refinement, for markerless alignment of preoperative models with patient anatomy. The tracking module employs a fast and robust registration algorithm that uses the initial pose from the registration module to estimate the target pose in real-time. We comprehensively evaluated the performance of both modules through simulation and real-world measurements. The results indicate that our markerless system achieves superior performance for registration and comparable performance for tracking to industrial solutions. The two-module design makes our system a one-stop solution for surgical procedures where the target anatomy moves or stays static during surgery.

Title: PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

Authors: Pengfei Wang, Hao Zheng, Zhigang Hu, Aikun Xu, Meiguang Zheng, Liu Yang (School of Computer Science and Engineering, Central South University, Changsha, China)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09502
Pdf URL: https://arxiv.org/pdf/2504.09502
Copy Paste: [[2504.09502]] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification(https://arxiv.org/abs/2504.09502)
Keywords: extraction
Abstract: Existing SAR image classification methods based on Contrastive Learning often rely on sample generation strategies designed for optical images, failing to capture the distinct semantic and physical characteristics of SAR data. To address this, we propose Physics-Driven Contrastive Mutual Learning for SAR Classification (PCM-SAR), which incorporates domain-specific physical insights to improve sample generation and feature extraction. PCM-SAR utilizes the gray-level co-occurrence matrix (GLCM) to simulate realistic noise patterns and applies semantic detection for unsupervised local sampling, ensuring generated samples accurately reflect SAR imaging properties. Additionally, a multi-level feature fusion mechanism based on mutual learning enables collaborative refinement of feature representations. Notably, PCM-SAR significantly enhances smaller models by refining SAR feature representations, compensating for their limited capacity. Experimental results show that PCM-SAR consistently outperforms SOTA methods across diverse datasets and SAR classification tasks.

Title: MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Authors: Wei Tao, Xiaoyang Qu, Kai Lu, Jiguang Wan, Guokuan Li, Jianzong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09504
Pdf URL: https://arxiv.org/pdf/2504.09504
Copy Paste: [[2504.09504]] MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs(https://arxiv.org/abs/2504.09504)
Keywords: large language model
Abstract: When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.

Title: FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

Authors: Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09507
Pdf URL: https://arxiv.org/pdf/2504.09507
Copy Paste: [[2504.09507]] FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution(https://arxiv.org/abs/2504.09507)
Keywords: segmentation
Abstract: Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

Title: DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

Authors: Puyu Han, Jiaju Kang, Yuhang Pan, Erting Pan, Zeyu Zhang, Qunchao Jin, Juntao Jiang, Zhichen Liu, Luqi Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09513
Pdf URL: https://arxiv.org/pdf/2504.09513
Copy Paste: [[2504.09513]] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion(https://arxiv.org/abs/2504.09513)
Keywords: diffusion
Abstract: Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.

Title: How new data permeates LLM knowledge and how to dilute it

Authors: Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09522
Pdf URL: https://arxiv.org/pdf/2504.09522
Copy Paste: [[2504.09522]] How new data permeates LLM knowledge and how to dilute it(https://arxiv.org/abs/2504.09522)
Keywords: robust, large language model
Abstract: Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM's existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a ``stepping-stone'' text augmentation strategy and (2) an ``ignore-k'' update pruning method. These approaches reduce undesirable priming effects by 50-95\% while preserving the model's ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: this https URL

Title: A Secure Communication Protocol for Remote Keyless Entry System with Adaptive Adjustment of Transmission Parameters

Authors: Jingjing Guo, Bo Tang, Jiayuan Xu, Qingyi Li, Yuyuan Qin, Xinghua Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09527
Pdf URL: https://arxiv.org/pdf/2504.09527
Copy Paste: [[2504.09527]] A Secure Communication Protocol for Remote Keyless Entry System with Adaptive Adjustment of Transmission Parameters(https://arxiv.org/abs/2504.09527)
Keywords: secure, security, attack
Abstract: Remote Keyless Entry (RKE) systems have become a standard feature in modern vehicles, yet their unidirectional fixed-frequency radio communication renders them vulnerable to replay attacks, impersonation attacks, cryptanalysis, and intentional interference. Existing cryptographic authentication methods enhance security but often fail to address real-world constraints such as computational efficiency and radio interference. To mitigate these threats, we designed the Adaptive Frequency-Hopping Algorithm and the Adaptive TXP and PHY Mode Control Algorithm that can dynamically optimize channel selection, transmission power, and PHY modes based on real-time channel quality assessment. To enhance the security and reliability of RKE systems, we propose the Lightweight Vehicle-Key Authentication Protocol. In addition, a prototype of the proposed scheme was implemented to verify its effectiveness in mitigating interference and preventing unauthorized this http URL results show that our scheme significantly enhances communication security and reliability while maintaining low computational overhead. Under mild interference conditions, the packet delivery rate (PDR) of the adaptive scheme increases from 93% to 99.23%, and under strong interference, it improves from 85% to 99.01%. Additionally, the scheme effectively prevents replay and impersonation attacks, ensuring secure vehicle access control by dynamically optimizing communication parameters to maintain stable and reliable transmission.

Title: AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Authors: Xing Zi, Tengjun Ni, Xianjing Fan, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09528
Pdf URL: https://arxiv.org/pdf/2504.09528
Copy Paste: [[2504.09528]] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions(https://arxiv.org/abs/2504.09528)
Keywords: robust, large language model
Abstract: Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

Title: Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

Authors: Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09530
Pdf URL: https://arxiv.org/pdf/2504.09530
Copy Paste: [[2504.09530]] Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders(https://arxiv.org/abs/2504.09530)
Keywords: transformer
Abstract: Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at this https URL.

Title: SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Authors: Xiang Hu, Pingping Zhang, Yuhao Wang, Bin Yan, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09549
Pdf URL: https://arxiv.org/pdf/2504.09549
Copy Paste: [[2504.09549]] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification(https://arxiv.org/abs/2504.09549)
Keywords: robust, diffusion, generative
Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.

Title: Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

Authors: Jinhao Li, Zijian Chen, Runze Dong, Tingzhu Chen, Changbo Wang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09555
Pdf URL: https://arxiv.org/pdf/2504.09555
Copy Paste: [[2504.09555]] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark(https://arxiv.org/abs/2504.09555)
Keywords: diffusion, generative
Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.

Title: Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

Authors: Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09566
Pdf URL: https://arxiv.org/pdf/2504.09566
Copy Paste: [[2504.09566]] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution(https://arxiv.org/abs/2504.09566)
Keywords: robust, large language model
Abstract: Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at this https URL.

Title: LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

Authors: Biao Fu, Minpeng Liao, Kai Fan, Chengxi Li, Liang Zhang, Yidong Chen, Xiaodong Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09570
Pdf URL: https://arxiv.org/pdf/2504.09570
Copy Paste: [[2504.09570]] LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline(https://arxiv.org/abs/2504.09570)
Keywords: large language model
Abstract: When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt "Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.

Title: Eccfrog512ck2: An Enhanced 512-bit Weierstrass Elliptic Curve

Authors: Víctor Duarte Melo, William J. Buchanan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09584
Pdf URL: https://arxiv.org/pdf/2504.09584
Copy Paste: [[2504.09584]] Eccfrog512ck2: An Enhanced 512-bit Weierstrass Elliptic Curve(https://arxiv.org/abs/2504.09584)
Keywords: security, attack
Abstract: Whilst many key exchange and digital signature methods use the NIST P256 (secp256r1) and secp256k1 curves, there is often a demand for increased security. With these curves, we have a 128-bit security. These security levels can be increased to 256-bit security with NIST P-521 Curve 448 and Brainpool-P512. This paper outlines a new curve - Eccfrog512ck2 - and which provides 256-bit security and enhanced performance over NIST P-521. Along with this, it has side-channel resistance and is designed to avoid weaknesses such as related to the MOV attack. It shows that Eccfrog512ck2 can have a 61.5% speed-up on scalar multiplication and a 33.3% speed-up on point generation over the NIST P-521 curve.

Title: Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Authors: Zuoli Tang, Junjie Ou, Kaiqin Hu, Chunwei Wu, Zhaoxin Huan, Chilin Fu, Xiaolu Zhang, Jun Zhou, Chenliang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09586
Pdf URL: https://arxiv.org/pdf/2504.09586
Copy Paste: [[2504.09586]] Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance(https://arxiv.org/abs/2504.09586)
Keywords: robust, large language model
Abstract: Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs' reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs' reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.

Title: TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

Authors: Zhicong Wu, Hongbin Xu, Gang Xu, Ping Nie, Zhixin Yan, Jinkai Zheng, Liangqiong Qu, Ming Li, Liqiang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09588
Pdf URL: https://arxiv.org/pdf/2504.09588
Copy Paste: [[2504.09588]] TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting(https://arxiv.org/abs/2504.09588)
Keywords: robust, diffusion, segmentation
Abstract: Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.

Title: ControlNET: A Firewall for RAG-based LLM System

Authors: Hongwei Yao, Haoran Shi, Yidou Chen, Yixin Jiang, Cong Wang, Zhan Qin, Kui Ren, Chun Chen
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09593
Pdf URL: https://arxiv.org/pdf/2504.09593
Copy Paste: [[2504.09593]] ControlNET: A Firewall for RAG-based LLM System(https://arxiv.org/abs/2504.09593)
Keywords: secure, security, privacy, defense, attack, robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) has significantly enhanced the factual accuracy and domain adaptability of Large Language Models (LLMs). This advancement has enabled their widespread deployment across sensitive domains such as healthcare, finance, and enterprise applications. RAG mitigates hallucinations by integrating external knowledge, yet introduces privacy risk and security risk, notably data breaching risk and data poisoning risk. While recent studies have explored prompt injection and poisoning attacks, there remains a significant gap in comprehensive research on controlling inbound and outbound query flows to mitigate these threats. In this paper, we propose an AI firewall, ControlNET, designed to safeguard RAG-based LLM systems from these vulnerabilities. ControlNET controls query flows by leveraging activation shift phenomena to detect adversarial queries and mitigate their impact through semantic divergence. We conduct comprehensive experiments on four different benchmark datasets including Msmarco, HotpotQA, FinQA, and MedicalSys using state-of-the-art open source LLMs (Llama3, Vicuna, and Mistral). Our results demonstrate that ControlNET achieves over 0.909 AUROC in detecting and mitigating security threats while preserving system harmlessness. Overall, ControlNET offers an effective, robust, harmless defense mechanism, marking a significant advancement toward the secure deployment of RAG-based LLM systems.

Title: Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

Authors: Jia Wei, Xiaoqi Zhao, Jonghye Woo, Jinsong Ouyang, Georges El Fakhri, Qingyu Chen, Xiaofeng Liu
Subjects: cs.CV, cs.LG, cs.MM, eess.IV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2504.09601
Pdf URL: https://arxiv.org/pdf/2504.09601
Copy Paste: [[2504.09601]] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation(https://arxiv.org/abs/2504.09601)
Keywords: robust, segmentation
Abstract: Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.

Title: Mitigating Many-Shot Jailbreaking

Authors: Christopher M. Ackerman, Nina Panickssery
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.09604
Pdf URL: https://arxiv.org/pdf/2504.09604
Copy Paste: [[2504.09604]] Mitigating Many-Shot Jailbreaking(https://arxiv.org/abs/2504.09604)
Keywords: attack
Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a ``fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the ``fake'' assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.

Title: Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Authors: Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan (Celine)Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09606
Pdf URL: https://arxiv.org/pdf/2504.09606
Copy Paste: [[2504.09606]] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training(https://arxiv.org/abs/2504.09606)
Keywords: diffusion, generative
Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at this https URL.

Title: Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

Authors: Jiuchen Chen, Xinyu Yan, Qizhi Xu, Kaiqi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09621
Pdf URL: https://arxiv.org/pdf/2504.09621
Copy Paste: [[2504.09621]] Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images(https://arxiv.org/abs/2504.09621)
Keywords: extraction
Abstract: Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 $\times$ 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 $\times$ 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at this https URL.

Title: Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Authors: Yamato Arai, Yuma Ichikawa
Subjects: cs.LG, stat.AP, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09629
Pdf URL: https://arxiv.org/pdf/2504.09629
Copy Paste: [[2504.09629]] Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization(https://arxiv.org/abs/2504.09629)
Keywords: large language model
Abstract: Layer-wise post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining. However, recent progress in this line of research is saturating, underscoring the need to revisit its core limitation and explore further improvements. This study identifies a critical bottleneck in existing layer-wise PTQ methods: the accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this, we propose Quantization Error Propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error which enable compensating for accumulated quantization errors. Additionally, we introduce a tunable propagation mechanism that allows for control over both propagation strength and computational overhead, making the framework adaptable to various architectures and resource constraints. Empirical evaluation on LLaMA2 models (7B, 13B, 70B) demonstrate that incorporating QEP into standard layer-wise PTQ pipelines outperforms standard PTQ methods. Notably, QEP yields substantial performance improvements under extreme low-bit quantization settings.

Title: Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

Authors: Haotian Wang, Han Zhao, Shuaiting Chen, Xiaoyu Tian, Sitong Zhao, Yunjie Ji, Yiping Peng, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09639
Pdf URL: https://arxiv.org/pdf/2504.09639
Copy Paste: [[2504.09639]] Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability(https://arxiv.org/abs/2504.09639)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.

Title: Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Authors: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
Subjects: cs.CL, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09643
Pdf URL: https://arxiv.org/pdf/2504.09643
Copy Paste: [[2504.09643]] Iterative Self-Training for Code Generation via Reinforced Re-Ranking(https://arxiv.org/abs/2504.09643)
Keywords: robust, generative
Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

Title: SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

Authors: Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09644
Pdf URL: https://arxiv.org/pdf/2504.09644
Copy Paste: [[2504.09644]] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model(https://arxiv.org/abs/2504.09644)
Keywords: large language model, segmentation
Abstract: Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at this https URL.

Title: Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar

Authors: Aung Kyaw Htet, Mark Dras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09645
Pdf URL: https://arxiv.org/pdf/2504.09645
Copy Paste: [[2504.09645]] Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar(https://arxiv.org/abs/2504.09645)
Keywords: large language model
Abstract: Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is made visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages. In this paper, we extend the XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.

Title: RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions

Authors: Guixian Chen, Jianhao Ma, Salar Fattahi
Subjects: cs.LG, cs.CV, stat.CO
Abstract URL: https://arxiv.org/abs/2504.09648
Pdf URL: https://arxiv.org/pdf/2504.09648
Copy Paste: [[2504.09648]] RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions(https://arxiv.org/abs/2504.09648)
Keywords: robust
Abstract: In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples -- some of which are tampered by an adaptive and strong adversary -- we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an error that scales with the Gaussian noise. Existing approaches to this problem often suffer from high computational costs or rely on restrictive distributional assumptions, limiting their applicability in truly adversarial settings. To address these challenges, we revisit the classical random sample consensus (RANSAC) algorithm, which offers strong robustness to adversarial outliers, but sacrifices efficiency and robustness against Gaussian noise and model misspecification in the process. We propose a two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure modes of standard RANSAC. Our method is provably robust to both Gaussian and adversarial corruptions, achieves near-optimal sample complexity without requiring prior knowledge of the subspace dimension, and is more efficient than existing RANSAC-type methods.

Title: Bridging Immutability with Flexibility: A Scheme for Secure and Efficient Smart Contract Upgrades

Authors: Tahrim Hossain, Sakib Hassan, Faisal Haque Bappy, Muhammad Nur Yanhaona, Tarannum Shaila Zaman, Tariqul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09652
Pdf URL: https://arxiv.org/pdf/2504.09652
Copy Paste: [[2504.09652]] Bridging Immutability with Flexibility: A Scheme for Secure and Efficient Smart Contract Upgrades(https://arxiv.org/abs/2504.09652)
Keywords: secure, security
Abstract: The emergence of blockchain technology has revolutionized contract execution through the introduction of smart contracts. Ethereum, the leading blockchain platform, leverages smart contracts to power decentralized applications (DApps), enabling transparent and self-executing systems across various domains. While the immutability of smart contracts enhances security and trust, it also poses significant challenges for updates, defect resolution, and adaptation to changing requirements. Existing upgrade mechanisms are complex, resource-intensive, and costly in terms of gas consumption, often compromising security and limiting practical adoption. To address these challenges, we propose FlexiContracts+, a novel scheme that reimagines smart contracts by enabling secure, in-place upgrades on Ethereum while preserving historical data without relying on multiple contracts or extensive pre-deployment planning. FlexiContracts+ enhances security, simplifies development, reduces engineering overhead, and supports adaptable, expandable smart contracts. Comprehensive testing demonstrates that FlexiContracts+ achieves a practical balance between immutability and flexibility, advancing the capabilities of smart contract systems.

Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Authors: Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09656
Pdf URL: https://arxiv.org/pdf/2504.09656
Copy Paste: [[2504.09656]] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation(https://arxiv.org/abs/2504.09656)
Keywords: diffusion
Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in this https URL.

Title: Ordinary Least Squares as an Attention Mechanism

Authors: Philippe Goulet Coulombe
Subjects: cs.LG, econ.EM, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2504.09663
Pdf URL: https://arxiv.org/pdf/2504.09663
Copy Paste: [[2504.09663]] Ordinary Least Squares as an Attention Mechanism(https://arxiv.org/abs/2504.09663)
Keywords: transformer, large language model
Abstract: I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.

Title: Adapting to the Unknown: Robust Meta-Learning for Zero-Shot Financial Time Series Forecasting

Authors: Anxian Liu, Junying Ma, Guang Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09664
Pdf URL: https://arxiv.org/pdf/2504.09664
Copy Paste: [[2504.09664]] Adapting to the Unknown: Robust Meta-Learning for Zero-Shot Financial Time Series Forecasting(https://arxiv.org/abs/2504.09664)
Keywords: robust
Abstract: Financial time series forecasting in the zero-shot setting is essential for risk management and investment decision-making, particularly during abrupt market regime shifts or in emerging markets with limited historical data. While Model-Agnostic Meta-Learning (MAML)-based approaches have shown promise in this domain, existing meta task construction strategies often lead to suboptimal performance, especially when dealing with highly turbulent financial time series. To address this challenge, we propose a novel task construction method that leverages learned embeddings for more effective meta-learning in the zero-shot setting. Specifically, we construct two complementary types of meta-tasks based on the learned embeddings: intra-cluster tasks and inter-cluster tasks. To capture diverse fine-grained patterns, we apply stochastic projection matrices to the learned embeddings and use clustering algorithm to form the tasks. Additionally, to improve generalization capabilities, we employ hard task mining strategies and leverage inter-cluster tasks to identify invariant patterns across different time series. Extensive experiments on the real world financial dataset demonstrate that our method significantly outperforms existing approaches, showing better generalization ability in the zero-shot scenario.

Title: CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering

Authors: Liqiang Wen, Guanming Xiong, Tong Mo, Bing Li, Weiping Li, Wen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09665
Pdf URL: https://arxiv.org/pdf/2504.09665
Copy Paste: [[2504.09665]] CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering(https://arxiv.org/abs/2504.09665)
Keywords: large language model
Abstract: This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.

Title: Can LLMs Revolutionize the Design of Explainable and Efficient TinyML Models?

Authors: Christophe El Zeinaty, Wassim Hamidouche, Glenn Herrou, Daniel Menard, Merouane Debbah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09685
Pdf URL: https://arxiv.org/pdf/2504.09685
Copy Paste: [[2504.09685]] Can LLMs Revolutionize the Design of Explainable and Efficient TinyML Models?(https://arxiv.org/abs/2504.09685)
Keywords: explainability, transformer, large language model
Abstract: This paper introduces a novel framework for designing efficient neural network architectures specifically tailored to tiny machine learning (TinyML) platforms. By leveraging large language models (LLMs) for neural architecture search (NAS), a vision transformer (ViT)-based knowledge distillation (KD) strategy, and an explainability module, the approach strikes an optimal balance between accuracy, computational efficiency, and memory usage. The LLM-guided search explores a hierarchical search space, refining candidate architectures through Pareto optimization based on accuracy, multiply-accumulate operations (MACs), and memory metrics. The best-performing architectures are further fine-tuned using logits-based KD with a pre-trained ViT-B/16 model, which enhances generalization without increasing model size. Evaluated on the CIFAR-100 dataset and deployed on an STM32H7 microcontroller (MCU), the three proposed models, LMaNet-Elite, LMaNet-Core, and QwNet-Core, achieve accuracy scores of 74.50%, 74.20% and 73.00%, respectively. All three models surpass current state-of-the-art (SOTA) models, such as MCUNet-in3/in4 (69.62% / 72.86%) and XiNet (72.27%), while maintaining a low computational cost of less than 100 million MACs and adhering to the stringent 320 KB static random-access memory (SRAM) constraint. These results demonstrate the efficiency and performance of the proposed framework for TinyML platforms, underscoring the potential of combining LLM-driven search, Pareto optimization, KD, and explainability to develop accurate, efficient, and interpretable models. This approach opens new possibilities in NAS, enabling the design of efficient architectures specifically suited for TinyML.

Title: Computer-Aided Layout Generation for Building Design: A Review

Authors: Jiachen Liu, Yuan Xue, Haomiao Ni, Rui Yu, Zihan Zhou, Sharon X. Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09694
Pdf URL: https://arxiv.org/pdf/2504.09694
Copy Paste: [[2504.09694]] Computer-Aided Layout Generation for Building Design: A Review(https://arxiv.org/abs/2504.09694)
Keywords: generative
Abstract: Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.

Title: GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Authors: Jixiao Zhang, Chunsheng Zuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09696
Pdf URL: https://arxiv.org/pdf/2504.09696
Copy Paste: [[2504.09696]] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models(https://arxiv.org/abs/2504.09696)
Keywords: robust
Abstract: Recent advances in R1-like reasoning models leveraging Group Relative Policy Optimization (GRPO) have significantly improved the performance of language models on mathematical reasoning tasks. However, current GRPO implementations encounter critical challenges, including reward sparsity due to binary accuracy metrics, limited incentives for conciseness, and insufficient focus on complex reasoning tasks. To address these issues, we propose GRPO-LEAD, a suite of novel enhancements tailored for mathematical reasoning. Specifically, GRPO-LEAD introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems. Furthermore, we systematically examine the impact of model scale and supervised fine-tuning (SFT) strategies, demonstrating that larger-scale base models and carefully curated datasets significantly enhance reinforcement learning effectiveness. Extensive empirical evaluations and ablation studies confirm that GRPO-LEAD substantially mitigates previous shortcomings, resulting in language models that produce more concise, accurate, and robust reasoning across diverse mathematical tasks.

Title: ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection

Authors: Zijian Wu, Shuojue Yang, Yueming Jin, Septimiu E Salcudean
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09700
Pdf URL: https://arxiv.org/pdf/2504.09700
Copy Paste: [[2504.09700]] ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection(https://arxiv.org/abs/2504.09700)
Keywords: segmentation
Abstract: In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.

Title: Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

Authors: Shuai Jiang, Saeed Hassanpour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09704
Pdf URL: https://arxiv.org/pdf/2504.09704
Copy Paste: [[2504.09704]] Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis(https://arxiv.org/abs/2504.09704)
Keywords: robust, interpretability, transformer
Abstract: Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.

Title: DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

Authors: Zhenting Wang, Guofeng Cui, Kun Wan, Wentian Zhao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.09710
Pdf URL: https://arxiv.org/pdf/2504.09710
Copy Paste: [[2504.09710]] DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training(https://arxiv.org/abs/2504.09710)
Keywords: large language model
Abstract: Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: this https URL.

Title: The Structural Safety Generalization Problem

Authors: Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.09712
Pdf URL: https://arxiv.org/pdf/2504.09712
Copy Paste: [[2504.09712]] The Structural Safety Generalization Problem(https://arxiv.org/abs/2504.09712)
Keywords: defense, attack, explainability
Abstract: LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.

Title: Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

Authors: Ayşe Aysu Cengiz, Ahmet Kaan Sever, Elif Ecem Ümütlü, Naime Şeyma Erdem, Burak Aytan, Büşra Tufan, Abdullah Topraksoy, Esra Darıcı, Cagri Toraman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09714
Pdf URL: https://arxiv.org/pdf/2504.09714
Copy Paste: [[2504.09714]] Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish(https://arxiv.org/abs/2504.09714)
Keywords: robust
Abstract: The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.

Title: Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

Authors: Vasilii Korolkov, Andrey Yanchenko
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09738
Pdf URL: https://arxiv.org/pdf/2504.09738
Copy Paste: [[2504.09738]] Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention(https://arxiv.org/abs/2504.09738)
Keywords: segmentation
Abstract: Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.

Title: Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Authors: Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Hamza Farooq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09753
Pdf URL: https://arxiv.org/pdf/2504.09753
Copy Paste: [[2504.09753]] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance(https://arxiv.org/abs/2504.09753)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

Title: Alleviating the Fear of Losing Alignment in LLM Fine-tuning

Authors: Kang Yang, Guanhong Tao, Xun Chen, Jun Xu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09757
Pdf URL: https://arxiv.org/pdf/2504.09757
Copy Paste: [[2504.09757]] Alleviating the Fear of Losing Alignment in LLM Fine-tuning(https://arxiv.org/abs/2504.09757)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at this https URL

Title: Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

Authors: Lucas Cardoso, Vitor Santos, José Ribeiro, Regiane Kawasaki, Ricardo Prudêncio, Ronnie Alves
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09759
Pdf URL: https://arxiv.org/pdf/2504.09759
Copy Paste: [[2504.09759]] Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness(https://arxiv.org/abs/2504.09759)
Keywords: robust, fair
Abstract: Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.

Title: Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems

Authors: John J. Vastola
Subjects: cs.LG, cond-mat.stat-mech
Abstract URL: https://arxiv.org/abs/2504.09761
Pdf URL: https://arxiv.org/pdf/2504.09761
Copy Paste: [[2504.09761]] Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems(https://arxiv.org/abs/2504.09761)
Keywords: fair, diffusion, generative
Abstract: Noether's theorem provides a powerful link between continuous symmetries and conserved quantities for systems governed by some variational principle. Perhaps unfortunately, most dynamical systems of interest in neuroscience and artificial intelligence cannot be described by any such principle. On the other hand, nonequilibrium physics provides a variational principle that describes how fairly generic noisy dynamical systems are most likely to transition between two states; in this work, we exploit this principle to apply Noether's theorem, and hence learn about how the continuous symmetries of dynamical systems constrain their most likely trajectories. We identify analogues of the conservation of energy, momentum, and angular momentum, and briefly discuss examples of each in the context of models of decision-making, recurrent neural networks, and diffusion generative models.

Title: Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Authors: Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09763
Pdf URL: https://arxiv.org/pdf/2504.09763
Copy Paste: [[2504.09763]] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems(https://arxiv.org/abs/2504.09763)
Keywords: generative
Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.

Title: Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Authors: Yuyang Ji, Haohan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09764
Pdf URL: https://arxiv.org/pdf/2504.09764
Copy Paste: [[2504.09764]] Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding(https://arxiv.org/abs/2504.09764)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

Title: An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection

Authors: Qiyao Tang, Xiangyang Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09776
Pdf URL: https://arxiv.org/pdf/2504.09776
Copy Paste: [[2504.09776]] An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection(https://arxiv.org/abs/2504.09776)
Keywords: security, attack, large language model
Abstract: Spam messages continue to present significant challenges to digital users, cluttering inboxes and posing security risks. Traditional spam detection methods, including rules-based, collaborative, and machine learning approaches, struggle to keep up with the rapidly evolving tactics employed by spammers. This project studies new spam detection systems that leverage Large Language Models (LLMs) fine-tuned with spam datasets. More importantly, we want to understand how LLM-based spam detection systems perform under adversarial attacks that purposefully modify spam emails and data poisoning attacks that exploit the differences between the training data and the massages in detection, to which traditional machine learning models are shown to be vulnerable. This experimentation employs two LLM models of GPT2 and BERT and three spam datasets of Enron, LingSpam, and SMSspamCollection for extensive training and testing tasks. The results show that, while they can function as effective spam filters, the LLM models are susceptible to the adversarial and data poisoning attacks. This research provides very useful insights for future applications of LLM models for information security.

Title: Reasoning without Regret

Authors: Tarun Chitra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09777
Pdf URL: https://arxiv.org/pdf/2504.09777
Copy Paste: [[2504.09777]] Reasoning without Regret(https://arxiv.org/abs/2504.09777)
Keywords: large language model
Abstract: Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(\Delta, \epsilon)$-gap rewards, our backward Euler solver achieves $\epsilon$-accuracy in $O\left((R_{\max}/\Delta)\log(1/\epsilon)\right)$ iterations with $O(\log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Title: Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Authors: Jingtian Wu, Claire Cardie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09781
Pdf URL: https://arxiv.org/pdf/2504.09781
Copy Paste: [[2504.09781]] Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning(https://arxiv.org/abs/2504.09781)
Keywords: large language model
Abstract: While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.

Title: EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Authors: Chao Liu, Arash Vahdat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09789
Pdf URL: https://arxiv.org/pdf/2504.09789
Copy Paste: [[2504.09789]] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise(https://arxiv.org/abs/2504.09789)
Keywords: diffusion
Abstract: Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.

Title: IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

Authors: Dinh Dai Quan Tran, Hoang-Thien Nguyen. Thanh-Huy Nguyen, Gia-Van To, Tien-Huy Nguyen, Quan Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09797
Pdf URL: https://arxiv.org/pdf/2504.09797
Copy Paste: [[2504.09797]] IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme(https://arxiv.org/abs/2504.09797)
Keywords: extraction, segmentation
Abstract: Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.

Title: Multi-task Federated Learning with Encoder-Decoder Structure: Enabling Collaborative Learning Across Different Tasks

Authors: Jingxuan Zhou, Weidong Bao, Ji Wang, Dayu Zhang, Xiongtao Zhang, Yaohong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09800
Pdf URL: https://arxiv.org/pdf/2504.09800
Copy Paste: [[2504.09800]] Multi-task Federated Learning with Encoder-Decoder Structure: Enabling Collaborative Learning Across Different Tasks(https://arxiv.org/abs/2504.09800)
Keywords: security, federate
Abstract: Federated learning has been extensively studied and applied due to its ability to ensure data security in distributed environments while building better models. However, clients participating in federated learning still face limitations, as clients with different structures or tasks cannot participate in learning together. In view of this, constructing a federated learning framework that allows collaboration between clients with different model structures and performing different tasks, enabling them to share valuable knowledge to enhance model efficiency, holds significant practical implications for the widespread application of federated learning. To achieve this goal, we propose a multi-task federated learning with encoder-decoder structure (M-Fed). Specifically, given the widespread adoption of the encoder-decoder architecture in current models, we leverage this structure to share intra-task knowledge through traditional federated learning methods and extract general knowledge from the encoder to achieve cross-task knowledge sharing. The training process is similar to traditional federated learning, and we incorporate local decoder and global decoder information into the loss function. The local decoder iteratively updates and gradually approaches the global decoder until sufficient cross-task knowledge sharing is achieved. Our method is lightweight and modular, demonstrating innovation compared to previous research. It enables clients performing different tasks to share general knowledge while maintaining the efficiency of traditional federated learning systems. We conducted experiments on two widely used benchmark datasets to verify the feasibility of M-Fed and compared it with traditional methods. The experimental results demonstrate the effectiveness of M-Fed in multi-task federated learning.

Title: Training Small Reasoning LLMs with Cognitive Preference Alignment

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09802
Pdf URL: https://arxiv.org/pdf/2504.09802
Copy Paste: [[2504.09802]] Training Small Reasoning LLMs with Cognitive Preference Alignment(https://arxiv.org/abs/2504.09802)
Keywords: large language model
Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.

Title: Efficient Multi-Task Modeling through Automated Fusion of Trained Models

Authors: Jingxuan Zhou, Weidong Bao, Ji Wang, Zhengyi Zhong, Dayu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09812
Pdf URL: https://arxiv.org/pdf/2504.09812
Copy Paste: [[2504.09812]] Efficient Multi-Task Modeling through Automated Fusion of Trained Models(https://arxiv.org/abs/2504.09812)
Keywords: transformer
Abstract: Although multi-task learning is widely applied in intelligent services, traditional multi-task modeling methods often require customized designs based on specific task combinations, resulting in a cumbersome modeling process. Inspired by the rapid development and excellent performance of single-task models, this paper proposes an efficient multi-task modeling method that can automatically fuse trained single-task models with different structures and tasks to form a multi-task model. As a general framework, this method allows modelers to simply prepare trained models for the required tasks, simplifying the modeling process while fully utilizing the knowledge contained in the trained models. This eliminates the need for excessive focus on task relationships and model structure design. To achieve this goal, we consider the structural differences among various trained models and employ model decomposition techniques to hierarchically decompose them into multiple operable model components. Furthermore, we have designed an Adaptive Knowledge Fusion (AKF) module based on Transformer, which adaptively integrates intra-task and inter-task knowledge based on model components. Through the proposed method, we achieve efficient and automated construction of multi-task models, and its effectiveness is verified through extensive experiments on three datasets.

Title: DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation

Authors: Beomseok Kang, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-Pang Chiu, Supun Samarasekera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09814
Pdf URL: https://arxiv.org/pdf/2504.09814
Copy Paste: [[2504.09814]] DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation(https://arxiv.org/abs/2504.09814)
Keywords: segmentation
Abstract: Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.

Title: Transferable text data distillation by trajectory matching

Authors: Rong Yao, Hailin Hu, Yifei Fu, Hanting Chen, Wenyi Fang, Fanyi Du, Kai Han, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09818
Pdf URL: https://arxiv.org/pdf/2504.09818
Copy Paste: [[2504.09818]] Transferable text data distillation by trajectory matching(https://arxiv.org/abs/2504.09818)
Keywords: robust, large language model
Abstract: In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

Title: Density-based Object Detection in Crowded Scenes

Authors: Chenyang Zhao, Jia Wan, Antoni B. Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09819
Pdf URL: https://arxiv.org/pdf/2504.09819
Copy Paste: [[2504.09819]] Density-based Object Detection in Crowded Scenes(https://arxiv.org/abs/2504.09819)
Keywords: robust
Abstract: Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses object density maps to jointly compute optimal anchor assignments and reweighing, as well as an adaptive NMS. Concretely, based on an unbalanced optimal transport (UOT) problem, the density owned by each ground-truth object is transported to each anchor position at a minimal transport cost. And density on anchors comprises an instance-specific density distribution, from which DGA decodes the optimal anchor assignment and re-weighting strategy. Meanwhile, DG-NMS utilizes the predicted density map to adaptively adjust the NMS threshold to reduce mistaken suppressions. In the UOT, a novel overlap-aware transport cost is specifically designed for ambiguous anchors caused by overlapped neighboring objects. Extensive experiments on the challenging CrowdHuman dataset with Citypersons dataset demonstrate that our proposed density-guided detector is effective and robust to crowdedness. The code and pre-trained models will be made available later.

Title: StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models

Authors: Yang Feng, Xudong Pan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09841
Pdf URL: https://arxiv.org/pdf/2504.09841
Copy Paste: [[2504.09841]] StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models(https://arxiv.org/abs/2504.09841)
Keywords: attack, large language model
Abstract: The proliferation of autonomous agents powered by large language models (LLMs) has revolutionized popular business applications dealing with tabular data, i.e., tabular agents. Although LLMs are observed to be vulnerable against prompt injection attacks from external data sources, tabular agents impose strict data formats and predefined rules on the attacker's payload, which are ineffective unless the agent navigates multiple layers of structural data to incorporate the payload. To address the challenge, we present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents. Our attack designs an evolutionary optimization procedure which continually refines attack payloads via the proposed constrained Monte Carlo Tree Search augmented by an off-topic evaluator. StruPhantom helps systematically explore and exploit the weaknesses of target applications to achieve goal hijacking. Our evaluation validates the effectiveness of StruPhantom across various LLM-based agents, including those on real-world platforms, and attack scenarios. Our attack achieves over 50% higher success rates than baselines in enforcing the application's response to contain phishing links or malicious codes.

Title: Accelerating Differentially Private Federated Learning via Adaptive Extrapolation

Authors: Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09850
Pdf URL: https://arxiv.org/pdf/2504.09850
Copy Paste: [[2504.09850]] Accelerating Differentially Private Federated Learning via Adaptive Extrapolation(https://arxiv.org/abs/2504.09850)
Keywords: privacy, attack, federate
Abstract: The federated learning (FL) framework enables multiple clients to collaboratively train machine learning models without sharing their raw data, but it remains vulnerable to privacy attacks. One promising approach is to incorporate differential privacy (DP)-a formal notion of privacy-into the FL framework. DP-FedAvg is one of the most popular algorithms for DP-FL, but it is known to suffer from the slow convergence in the presence of heterogeneity among clients' data. Most of the existing methods to accelerate DP-FL require 1) additional hyperparameters or 2) additional computational cost for clients, which is not desirable since 1) hyperparameter tuning is computationally expensive and data-dependent choice of hyperparameters raises the risk of privacy leakage, and 2) clients are often resource-constrained. To address this issue, we propose DP-FedEXP, which adaptively selects the global step size based on the diversity of the local updates without requiring any additional hyperparameters or client computational cost. We show that DP-FedEXP provably accelerates the convergence of DP-FedAvg and it empirically outperforms existing methods tailored for DP-FL.

Title: GFT: Gradient Focal Transformer

Authors: Boris Kriuk, Simranjit Kaur Gill, Shoaib Aslam, Amir Fakhrutdinov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09852
Pdf URL: https://arxiv.org/pdf/2504.09852
Copy Paste: [[2504.09852]] GFT: Gradient Focal Transformer(https://arxiv.org/abs/2504.09852)
Keywords: extraction, transformer
Abstract: Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

Title: RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence

Authors: Zengyuan Lai, Jiarui Yang, Songpengcheng Xia, Lizhou Lin, Lan Sun, Renwen Wang, Jianran Liu, Qi Wu, Ling Pei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.09862
Pdf URL: https://arxiv.org/pdf/2504.09862
Copy Paste: [[2504.09862]] RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence(https://arxiv.org/abs/2504.09862)
Keywords: privacy, large language model
Abstract: Millimeter-wave radar provides a privacy-preserving solution for human motion analysis, yet its sparse point clouds pose significant challenges for semantic understanding. We present Radar-LLM, the first framework that leverages large language models (LLMs) for human motion understanding using millimeter-wave radar as the sensing modality. Our approach introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture that incorporates deformable body templates and masked trajectory modeling to encode spatiotemporal point clouds into compact semantic tokens, and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To address data scarcity, we introduce a physics-aware synthesis pipeline that generates realistic radar-text pairs from motion-text datasets. Extensive experiments demonstrate that Radar-LLM achieves state-of-the-art performance across both synthetic and real-world benchmarks, enabling accurate translation of millimeter-wave signals to natural language descriptions. This breakthrough facilitates comprehensive motion understanding in privacy-sensitive applications like healthcare and smart homes. We will release the full implementation to support further research on this https URL.

Title: HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation

Authors: Tran Quoc Khanh Le, Nguyen Lan Vi Vu, Ha-Hieu Pham, Xuan-Loc Huynh, Tien-Huy Nguyen, Minh Huu Nhat Le, Quan Nguyen, Hien D. Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09876
Pdf URL: https://arxiv.org/pdf/2504.09876
Copy Paste: [[2504.09876]] HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation(https://arxiv.org/abs/2504.09876)
Keywords: segmentation
Abstract: Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and fuzzy boundaries. While convolutional neural networks (CNNs) have shown promising results in medical image segmentation, their performance is often limited by the need for large-scale annotated datasets - an impractical requirement in clinical ultrasound imaging. Semi-supervised learning (SSL) offers a compelling solution by leveraging unlabeled data, but existing teacher-student frameworks often suffer from confirmation bias and high computational costs. We propose HDC, a novel semi-supervised segmentation framework that integrates Hierarchical Distillation and Consistency learning within a multi-level noise mean-teacher framework. Unlike conventional approaches that rely solely on pseudo-labeling, we introduce a hierarchical distillation mechanism that guides feature-level learning via two novel objectives: (1) Correlation Guidance Loss to align feature representations between the teacher and main student branch, and (2) Mutual Information Loss to stabilize representations between the main and noisy student branches. Our framework reduces model complexity while improving generalization. Extensive experiments on two fetal ultrasound datasets, FUGC and PSFH, demonstrate that our method achieves competitive performance with significantly lower computational overhead than existing multi-teacher models.

Title: Revisiting the attacker's knowledge in inference attacks against Searchable Symmetric Encryption

Authors: Marc Damie, Jean-Benoist Leger, Florian Hahn, Andreas Peter
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09879
Pdf URL: https://arxiv.org/pdf/2504.09879
Copy Paste: [[2504.09879]] Revisiting the attacker's knowledge in inference attacks against Searchable Symmetric Encryption(https://arxiv.org/abs/2504.09879)
Keywords: security, privacy, attack
Abstract: Encrypted search schemes have been proposed to address growing privacy concerns. However, several leakage-abuse attacks have highlighted some security vulnerabilities. Recent attacks assumed an attacker's knowledge containing data ``similar'' to the indexed data. However, this vague assumption is barely discussed in literature: how likely is it for an attacker to obtain a "similar enough" data? Our paper provides novel statistical tools usable on any attack in this setting to analyze its sensitivity to data similarity. First, we introduce a mathematical model based on statistical estimators to analytically understand the attackers' knowledge and the notion of similarity. Second, we conceive statistical tools to model the influence of the similarity on the attack accuracy. We apply our tools on three existing attacks to answer questions such as: is similarity the only factor influencing accuracy of a given attack? Third, we show that the enforcement of a maximum index size can make the ``similar-data'' assumption harder to satisfy. In particular, we propose a statistical method to estimate an appropriate maximum size for a given attack and dataset. For the best known attack on the Enron dataset, a maximum index size of 200 guarantees (with high probability) the attack accuracy to be below 5%.

Title: Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Authors: Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du, Zhiwei Xu, Longxiang Gao, Li Guo, Shibiao Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09881
Pdf URL: https://arxiv.org/pdf/2504.09881
Copy Paste: [[2504.09881]] Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition(https://arxiv.org/abs/2504.09881)
Keywords: extraction
Abstract: Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at this https URL

Title: Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English

Authors: Michael Kamerath, Aniello De Santo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09886
Pdf URL: https://arxiv.org/pdf/2504.09886
Copy Paste: [[2504.09886]] Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English(https://arxiv.org/abs/2504.09886)
Keywords: transformer
Abstract: This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs' linguistic knowledge and biases.

Title: Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Authors: Yiwen Wang, Ying Liang, Yuxuan Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09887
Pdf URL: https://arxiv.org/pdf/2504.09887
Copy Paste: [[2504.09887]] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution(https://arxiv.org/abs/2504.09887)
Keywords: robust, extraction, diffusion
Abstract: Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

Title: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Authors: Shuai Zhao, Linchao Zhu, Yi Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09895
Pdf URL: https://arxiv.org/pdf/2504.09895
Copy Paste: [[2504.09895]] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data(https://arxiv.org/abs/2504.09895)
Keywords: large language model
Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In various alignment scenarios, such as general human preference, safety, and confidence alignment, binary preference data collection and reward modeling are resource-intensive but necessary for human preference transferring. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function for LLM alignment. Using similarity as a reward circumvents training reward models, and collecting a single reference answer potentially costs less time than constructing binary preference pairs when multiple candidates are available. Specifically, we develop \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm, which is free of reference and reward models. Instead, RefAlign utilizes BERTScore between sampled generations and high-quality reference answers as the surrogate reward. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, {RefAlign} demonstrates comparable performance to previous alignment methods while offering high efficiency.

Title: TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

Authors: Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09896
Pdf URL: https://arxiv.org/pdf/2504.09896
Copy Paste: [[2504.09896]] TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models(https://arxiv.org/abs/2504.09896)
Keywords: robust, extraction, transformer, generative
Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.

Title: TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models

Authors: Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polisetty, Yi R. (May)Fung, Paul Pu Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09897
Pdf URL: https://arxiv.org/pdf/2504.09897
Copy Paste: [[2504.09897]] TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models(https://arxiv.org/abs/2504.09897)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token attributes across layers and modalities inherent to MLLMs. Inspired by this observation, we propose TAMP, a simple yet effective pruning framework tailored for MLLMs, featuring two key components: (1) Diversity-Aware Sparsity, which adjusts sparsity ratio per layer based on diversities among multimodal output tokens, preserving more parameters in high-diversity layers; and (2) Adaptive Multimodal Input Activation, which identifies representative multimodal input tokens using attention scores to guide unstructured weight pruning. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities. Empirical experiments across various multimodal evaluation benchmarks demonstrate that each component of our approach substantially outperforms existing pruning techniques.

Title: Refining Financial Consumer Complaints through Multi-Scale Model Interaction

Authors: Bo-Wei Chen, An-Zi Yen, Chung-Chi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09903
Pdf URL: https://arxiv.org/pdf/2504.09903
Copy Paste: [[2504.09903]] Refining Financial Consumer Complaints through Multi-Scale Model Interaction(https://arxiv.org/abs/2504.09903)
Keywords: robust, large language model
Abstract: Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.

Title: Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

Authors: Yujing Wang, Hainan Zhang, Liang Pang, Yongxin Tong, Binghui Guo, Hongwei Zheng, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09910
Pdf URL: https://arxiv.org/pdf/2504.09910
Copy Paste: [[2504.09910]] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2504.09910)
Keywords: privacy, attack, generative, large language model
Abstract: Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.

Title: Guiding Reasoning in Small Language Models with LLM Assistance

Authors: Yujin Kim, Euiin Yi, Minu Kim, Se-Young Yun, Taehyeon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09923
Pdf URL: https://arxiv.org/pdf/2504.09923
Copy Paste: [[2504.09923]] Guiding Reasoning in Small Language Models with LLM Assistance(https://arxiv.org/abs/2504.09923)
Keywords: large language model
Abstract: The limited reasoning capabilities of small language models (SLMs) cast doubt on their suitability for tasks demanding deep, multi-step logical deduction. This paper introduces a framework called Small Reasons, Large Hints (SMART), which selectively augments SLM reasoning with targeted guidance from large language models (LLMs). Inspired by the concept of cognitive scaffolding, SMART employs a score-based evaluation to identify uncertain reasoning steps and injects corrective LLM-generated reasoning only when necessary. By framing structured reasoning as an optimal policy search, our approach steers the reasoning trajectory toward correct solutions without exhaustive sampling. Our experiments on mathematical reasoning datasets demonstrate that targeted external scaffolding significantly improves performance, paving the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.

Title: FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Authors: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09925
Pdf URL: https://arxiv.org/pdf/2504.09925
Copy Paste: [[2504.09925]] FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding(https://arxiv.org/abs/2504.09925)
Keywords: large language model
Abstract: We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. this https URL

Title: KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference

Authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.09936
Pdf URL: https://arxiv.org/pdf/2504.09936
Copy Paste: [[2504.09936]] KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference(https://arxiv.org/abs/2504.09936)
Keywords: large language model
Abstract: Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.

Title: FedRecon: Missing Modality Reconstruction in Distributed Heterogeneous Environments

Authors: Junming Liu, Guosun Zeng, Ding Wang, Yanting Gao, Yufei Jin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09941
Pdf URL: https://arxiv.org/pdf/2504.09941
Copy Paste: [[2504.09941]] FedRecon: Missing Modality Reconstruction in Distributed Heterogeneous Environments(https://arxiv.org/abs/2504.09941)
Keywords: federate
Abstract: Multimodal data are often incomplete and exhibit Non-Independent and Identically Distributed (Non-IID) characteristics in real-world scenarios. These inherent limitations lead to both modality heterogeneity through partial modality absence and data heterogeneity from distribution divergence, creating fundamental challenges for effective federated learning (FL). To address these coupled challenges, we propose FedRecon, the first method targeting simultaneous missing modality reconstruction and Non-IID adaptation in multimodal FL. Our approach first employs a lightweight Multimodal Variational Autoencoder (MVAE) to reconstruct missing modalities while preserving cross-modal consistency. Distinct from conventional imputation methods, we achieve sample-level alignment through a novel distribution mapping mechanism that guarantees both data consistency and completeness. Additionally, we introduce a strategy employing global generator freezing to prevent catastrophic forgetting, which in turn mitigates Non-IID fluctuations. Extensive evaluations on multimodal datasets demonstrate FedRecon's superior performance in modality reconstruction under Non-IID conditions, surpassing state-of-the-art methods.

Title: Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Authors: Huijie Liu, Bingcan Wang, Jie Hu, Xiaoming Wei, Guoliang Kang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09948
Pdf URL: https://arxiv.org/pdf/2504.09948
Copy Paste: [[2504.09948]] Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes(https://arxiv.org/abs/2504.09948)
Keywords: large language model
Abstract: Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

Title: C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset

Authors: Fuqiang Niu, Yi Yang, Xianghua Fu, Genan Dai, Bowen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09958
Pdf URL: https://arxiv.org/pdf/2504.09958
Copy Paste: [[2504.09958]] C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset(https://arxiv.org/abs/2504.09958)
Keywords: large language model
Abstract: Stance detection has become an essential tool for analyzing public discussions on social media. Current methods face significant challenges, particularly in Chinese language processing and multi-turn conversational analysis. To address these limitations, we introduce C-MTCSD, the largest Chinese multi-turn conversational stance detection dataset, comprising 24,264 carefully annotated instances from Sina Weibo, which is 4.2 times larger than the only prior Chinese conversational stance detection dataset. Our comprehensive evaluation using both traditional approaches and large language models reveals the complexity of C-MTCSD: even state-of-the-art models achieve only 64.07% F1 score in the challenging zero-shot setting, while performance consistently degrades with increasing conversation depth. Traditional models particularly struggle with implicit stance detection, achieving below 50% F1 score. This work establishes a challenging new benchmark for Chinese stance detection research, highlighting significant opportunities for future improvements.

Title: Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

Authors: Hoang M. Truong, Vinh-Thuan Ly, Huy G. Tran, Thuan-Phat Nguyen, Tram T. Doan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09960
Pdf URL: https://arxiv.org/pdf/2504.09960
Copy Paste: [[2504.09960]] Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling(https://arxiv.org/abs/2504.09960)
Keywords: robust, extraction
Abstract: Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

Title: Towards Unbiased Federated Graph Learning: Label and Topology Perspectives

Authors: Zhengyu Wu, Boyang Pang, Xunkai Li, Yinlin Zhu, Daohan Su, Bowen Fan, Rong-Hua Li, Guoren Wang, Chenghu Zhou
Subjects: cs.LG, cs.AI, cs.DB, cs.SI
Abstract URL: https://arxiv.org/abs/2504.09963
Pdf URL: https://arxiv.org/pdf/2504.09963
Copy Paste: [[2504.09963]] Towards Unbiased Federated Graph Learning: Label and Topology Perspectives(https://arxiv.org/abs/2504.09963)
Keywords: privacy, federate, fair
Abstract: Federated Graph Learning (FGL) enables privacy-preserving, distributed training of graph neural networks without sharing raw data. Among its approaches, subgraph-FL has become the dominant paradigm, with most work focused on improving overall node classification accuracy. However, these methods often overlook fairness due to the complexity of node features, labels, and graph structures. In particular, they perform poorly on nodes with disadvantaged properties, such as being in the minority class within subgraphs or having heterophilous connections (neighbors with dissimilar labels or misleading features). This reveals a critical issue: high accuracy can mask degraded performance on structurally or semantically marginalized nodes. To address this, we advocate for two fairness goals: (1) improving representation of minority class nodes for class-wise fairness and (2) mitigating topological bias from heterophilous connections for topology-aware fairness. We propose FairFGL, a novel framework that enhances fairness through fine-grained graph mining and collaborative learning. On the client side, the History-Preserving Module prevents overfitting to dominant local classes, while the Majority Alignment Module refines representations of heterophilous majority-class nodes. The Gradient Modification Module transfers minority-class knowledge from structurally favorable clients to improve fairness. On the server side, FairFGL uploads only the most influenced subset of parameters to reduce communication costs and better reflect local distributions. A cluster-based aggregation strategy reconciles conflicting updates and curbs global majority dominance . Extensive evaluations on eight benchmarks show FairFGL significantly improves minority-group performance , achieving up to a 22.62 percent Macro-F1 gain while enhancing convergence over state-of-the-art baselines.

Title: Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

Authors: Xun Zhu, Fanbin Mo, Zheng Zhang, Jiaxi Wang, Yiming Shi, Ming Wu, Chuang Zhang, Miao Li, Ji Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09967
Pdf URL: https://arxiv.org/pdf/2504.09967
Copy Paste: [[2504.09967]] Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data(https://arxiv.org/abs/2504.09967)
Keywords: large language model
Abstract: The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

Title: Proofs of Useful Work from Arbitrary Matrix Multiplication

Authors: Ilan Komargodski, Itamar Schen, Omri Weinstein
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09971
Pdf URL: https://arxiv.org/pdf/2504.09971
Copy Paste: [[2504.09971]] Proofs of Useful Work from Arbitrary Matrix Multiplication(https://arxiv.org/abs/2504.09971)
Keywords: security
Abstract: We revisit the longstanding open problem of implementing Nakamoto's proof-of-work (PoW) consensus based on a real-world computational task $T(x)$ (as opposed to artificial random hashing), in a truly permissionless setting where the miner itself chooses the input $x$. The challenge in designing such a Proof-of-Useful-Work (PoUW) protocol, is using the native computation of $T(x)$ to produce a PoW certificate with prescribed hardness and with negligible computational overhead over the worst-case complexity of $T(\cdot)$ -- This ensures malicious miners cannot ``game the system" by fooling the verifier to accept with higher probability compared to honest miners (while using similar computational resources). Indeed, obtaining a PoUW with $O(1)$-factor overhead is trivial for any task $T$, but also useless. Our main result is a PoUW for the task of Matrix Multiplication $MatMul(A,B)$ of arbitrary matrices with $1+o(1)$ multiplicative overhead compared to naive $MatMul$ (even in the presence of Fast Matrix Multiplication-style algorithms, which are currently impractical). We conjecture that our protocol has optimal security in the sense that a malicious prover cannot obtain any significant advantage over an honest prover. This conjecture is based on reducing hardness of our protocol to the task of solving a batch of low-rank random linear equations which is of independent interest. Since $MatMul$s are the bottleneck of AI compute as well as countless industry-scale applications, this primitive suggests a concrete design of a new L1 base-layer protocol, which nearly eliminates the energy-waste of Bitcoin mining -- allowing GPU consumers to reduce their AI training and inference costs by ``re-using" it for blockchain consensus, in exchange for block rewards (2-for-1). This blockchain is currently under construction.

Title: EthCluster: An Unsupervised Static Analysis Method for Ethereum Smart Contract

Authors: Hong-Sheng Huang, Jen-Yi Ho, Hao-Wen Chen, Hung-Min Sun
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09977
Pdf URL: https://arxiv.org/pdf/2504.09977
Copy Paste: [[2504.09977]] EthCluster: An Unsupervised Static Analysis Method for Ethereum Smart Contract(https://arxiv.org/abs/2504.09977)
Keywords: attack, robust, steal
Abstract: Poorly designed smart contracts are particularly vulnerable, as they may allow attackers to exploit weaknesses and steal the virtual currency they manage. In this study, we train a model using unsupervised learning to identify vulnerabilities in the Solidity source code of Ethereum smart contracts. To address the challenges associated with real-world smart contracts, our training data is derived from actual vulnerability samples obtained from datasets such as SmartBugs Curated and the SolidiFI Benchmark. These datasets enable us to develop a robust unsupervised static analysis method for detecting five specific vulnerabilities: Reentrancy, Access Control, Timestamp Dependency, this http URL, and Unchecked Low-Level Calls. We employ clustering algorithms to identify outliers, which are subsequently classified as vulnerable smart contracts.

Title: Turn-taking annotation for quantitative and qualitative analyses of conversation

Authors: Anneliese Kelterer, Barbara Schuppler
Subjects: cs.CL, cs.DB, cs.HC, eess.AS
Abstract URL: https://arxiv.org/abs/2504.09980
Pdf URL: https://arxiv.org/pdf/2504.09980
Copy Paste: [[2504.09980]] Turn-taking annotation for quantitative and qualitative analyses of conversation(https://arxiv.org/abs/2504.09980)
Keywords: segmentation
Abstract: This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.

Title: Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning

Authors: LeiLei Ma, Shuo Xu, MingKun Xie, Lei Wang, Dengdi Sun, Haifeng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09990
Pdf URL: https://arxiv.org/pdf/2504.09990
Copy Paste: [[2504.09990]] Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning(https://arxiv.org/abs/2504.09990)
Keywords: transformer
Abstract: Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-aware visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group-aware to the label-aware, adaptively obtaining label-aware representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods on multiple pre-trained models.

Title: Metric-Guided Synthesis of Class Activation Mapping

Authors: Alejandro Luque-Cerpa, Elizabeth Polgreen, Ajitha Rajan, Hazem Torfah
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09998
Pdf URL: https://arxiv.org/pdf/2504.09998
Copy Paste: [[2504.09998]] Metric-Guided Synthesis of Class Activation Mapping(https://arxiv.org/abs/2504.09998)
Keywords: robust
Abstract: Class activation mapping (CAM) is a widely adopted class of saliency methods used to explain the behavior of convolutional neural networks (CNNs). These methods generate heatmaps that highlight the parts of the input most relevant to the CNN output. Various CAM methods have been proposed, each distinguished by the expressions used to derive heatmaps. In general, users look for heatmaps with specific properties that reflect different aspects of CNN functionality. These may include similarity to ground truth, robustness, equivariance, and more. Although existing CAM methods implicitly encode some of these properties in their expressions, they do not allow for variability in heatmap generation following the user's intent or domain knowledge. In this paper, we address this limitation by introducing SyCAM, a metric-based approach for synthesizing CAM expressions. Given a predefined evaluation metric for saliency maps, SyCAM automatically generates CAM expressions optimized for that metric. We specifically explore a syntax-guided synthesis instantiation of SyCAM, where CAM expressions are derived based on predefined syntactic constraints and the given metric. Using several established evaluation metrics, we demonstrate the efficacy and flexibility of our approach in generating targeted heatmaps. We compare SyCAM with other well-known CAM methods on three prominent models: ResNet50, VGG16, and VGG19.

Title: Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?

Authors: Yanbo Wang, Jiyang Guan, Jian Liang, Ran He
Subjects: cs.CR, cs.AI, cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10000
Pdf URL: https://arxiv.org/pdf/2504.10000
Copy Paste: [[2504.10000]] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?(https://arxiv.org/abs/2504.10000)
Keywords: security, defense, attack, large language model
Abstract: Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability, while the specific knowledge or patterns acquired from the high-quality dataset remain unclear. Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences. Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction finetuning. Simply correcting the underlying data bias could narrow the safety gap in the vision domain.

Title: GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Authors: Junlin Hao, Peiheng Wang, Haoyang Wang, Xinggong Zhang, Zongming Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10001
Pdf URL: https://arxiv.org/pdf/2504.10001
Copy Paste: [[2504.10001]] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting(https://arxiv.org/abs/2504.10001)
Keywords: robust, diffusion, generative
Abstract: Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

Title: Improving Controller Generalization with Dimensionless Markov Decision Processes

Authors: Valentin Charvet, Sebastian Stein, Roderick Murray-Smith
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10006
Pdf URL: https://arxiv.org/pdf/2504.10006
Copy Paste: [[2504.10006]] Improving Controller Generalization with Dimensionless Markov Decision Processes(https://arxiv.org/abs/2504.10006)
Keywords: robust
Abstract: Controllers trained with Reinforcement Learning tend to be very specialized and thus generalize poorly when their testing environment differs from their training one. We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space. To do so, we introduce the Dimensionless Markov Decision Process ($\Pi$-MDP): an extension of Contextual-MDPs in which state and action spaces are non-dimensionalized with the Buckingham-$\Pi$ theorem. This procedure induces policies that are equivariant with respect to changes in the context of the underlying dynamics. We provide a generic framework for this approach and apply it to a model-based policy search algorithm using Gaussian Process models. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.

Title: Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network

Authors: Hang Yin, Yan-Ming Zhang, Jian Xu, Jian-Long Chang, Yin Li, Cheng-Lin Liu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10014
Pdf URL: https://arxiv.org/pdf/2504.10014
Copy Paste: [[2504.10014]] Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network(https://arxiv.org/abs/2504.10014)
Keywords: protect
Abstract: Air quality prediction plays a crucial role in public health and environmental protection. Accurate air quality prediction is a complex multivariate spatiotemporal problem, that involves interactions across temporal patterns, pollutant correlations, spatial station dependencies, and particularly meteorological influences that govern pollutant dispersion and chemical transformations. Existing works underestimate the critical role of atmospheric conditions in air quality prediction and neglect comprehensive meteorological data utilization, thereby impairing the modeling of dynamic interdependencies between air quality and meteorological data. To overcome this, we propose MDSTNet, an encoder-decoder framework that explicitly models air quality observations and atmospheric conditions as distinct modalities, integrating multi-pressure-level meteorological data and weather forecasts to capture atmosphere-pollution dependencies for prediction. Meantime, we construct ChinaAirNet, the first nationwide dataset combining air quality records with multi-pressure-level meteorological observations. Experimental results on ChinaAirNet demonstrate MDSTNet's superiority, substantially reducing 48-hour prediction errors by 17.54\% compared to the state-of-the-art model. The source code and dataset will be available on github.

Title: Quantifying Privacy Leakage in Split Inference via Fisher-Approximated Shannon Information Analysis

Authors: Ruijun Deng, Zhihui Lu, Qiang Duan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10016
Pdf URL: https://arxiv.org/pdf/2504.10016
Copy Paste: [[2504.10016]] Quantifying Privacy Leakage in Split Inference via Fisher-Approximated Shannon Information Analysis(https://arxiv.org/abs/2504.10016)
Keywords: privacy, defense, attack
Abstract: Split inference (SI) partitions deep neural networks into distributed sub-models, enabling privacy-preserving collaborative learning. Nevertheless, it remains vulnerable to Data Reconstruction Attacks (DRAs), wherein adversaries exploit exposed smashed data to reconstruct raw inputs. Despite extensive research on adversarial attack-defense games, a shortfall remains in the fundamental analysis of privacy risks. This paper establishes a theoretical framework for privacy leakage quantification using information theory, defining it as the adversary's certainty and deriving both average-case and worst-case error bounds. We introduce Fisher-approximated Shannon information (FSInfo), a novel privacy metric utilizing Fisher Information (FI) for operational privacy leakage computation. We empirically show that our privacy metric correlates well with empirical attacks and investigate some of the factors that affect privacy leakage, namely the data distribution, model size, and overfitting.

Title: The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination

Authors: Hao Yin, Gunagzong Si, Zilei Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10020
Pdf URL: https://arxiv.org/pdf/2504.10020
Copy Paste: [[2504.10020]] The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination(https://arxiv.org/abs/2504.10020)
Keywords: large language model
Abstract: Contrastive decoding strategies are widely used to reduce hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

Title: Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

Authors: Nikolai Röhrich (XITASO GmbH and LMU Munich), Alwin Hoffmann (XITASO GmbH), Richard Nordsieck (XITASO GmbH), Emilio Zarbali (XITASO GmbH), Alireza Javanmardi (LMU Munich and Munich Center for Machine Learning)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10021
Pdf URL: https://arxiv.org/pdf/2504.10021
Copy Paste: [[2504.10021]] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics(https://arxiv.org/abs/2504.10021)
Keywords: interpretability, transformer
Abstract: Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.

Title: DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify

Authors: Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10036
Pdf URL: https://arxiv.org/pdf/2504.10036
Copy Paste: [[2504.10036]] DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify(https://arxiv.org/abs/2504.10036)
Keywords: privacy, large language model
Abstract: Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.

Title: Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI

Authors: Sergey Kuznetsov, Sanduni Pinnawala, Peter A. Wijeratne, Ivor J. A. Simpson
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10039
Pdf URL: https://arxiv.org/pdf/2504.10039
Copy Paste: [[2504.10039]] Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI(https://arxiv.org/abs/2504.10039)
Keywords: diffusion
Abstract: Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.

Title: Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

Authors: Chengyi Du, Keyan Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10048
Pdf URL: https://arxiv.org/pdf/2504.10048
Copy Paste: [[2504.10048]] Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers(https://arxiv.org/abs/2504.10048)
Keywords: robust, transformer
Abstract: Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

Title: Hallucination Detection in LLMs via Topological Divergence on Attention Graphs

Authors: Alexandra Bazarova, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Andrei Volodichev, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10063
Pdf URL: https://arxiv.org/pdf/2504.10063
Copy Paste: [[2504.10063]] Hallucination Detection in LLMs via Topological Divergence on Attention Graphs(https://arxiv.org/abs/2504.10063)
Keywords: robust, large language model
Abstract: Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments, including evaluation on question answering and data-to-text tasks, show that our approach achieves state-of-the-art or competitive results on several benchmarks, two of which were annotated by us and are being publicly released to facilitate further research. Beyond its strong in-domain performance, TOHA maintains remarkable domain transferability across multiple open-source LLMs. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

Title: A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

Authors: Zeng Ren, Xinyi Guan, Martin Rohrmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10065
Pdf URL: https://arxiv.org/pdf/2504.10065
Copy Paste: [[2504.10065]] A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations(https://arxiv.org/abs/2504.10065)
Keywords: generative
Abstract: Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.

Title: Undermining Federated Learning Accuracy in EdgeIoT via Variational Graph Auto-Encoders

Authors: Kai Li, Shuyan Hu, Bochun Wu, Sai Zou, Wei Ni, Falko Dressler
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10067
Pdf URL: https://arxiv.org/pdf/2504.10067
Copy Paste: [[2504.10067]] Undermining Federated Learning Accuracy in EdgeIoT via Variational Graph Auto-Encoders(https://arxiv.org/abs/2504.10067)
Keywords: privacy, attack, federate
Abstract: EdgeIoT represents an approach that brings together mobile edge computing with Internet of Things (IoT) devices, allowing for data processing close to the data source. Sending source data to a server is bandwidth-intensive and may compromise privacy. Instead, federated learning allows each device to upload a shared machine-learning model update with locally processed data. However, this technique, which depends on aggregating model updates from various IoT devices, is vulnerable to attacks from malicious entities that may inject harmful data into the learning process. This paper introduces a new attack method targeting federated learning in EdgeIoT, known as data-independent model manipulation attack. This attack does not rely on training data from the IoT devices but instead uses an adversarial variational graph auto-encoder (AV-GAE) to create malicious model updates by analyzing benign model updates intercepted during communication. AV-GAE identifies and exploits structural relationships between benign models and their training data features. By manipulating these structural correlations, the attack maximizes the training loss of the federated learning system, compromising its overall effectiveness.

Title: Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Authors: Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10068
Pdf URL: https://arxiv.org/pdf/2504.10068
Copy Paste: [[2504.10068]] Mavors: Multi-granularity Video Representation for Multimodal Large Language Model(https://arxiv.org/abs/2504.10068)
Keywords: transformer, large language model
Abstract: Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Title: Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction

Authors: Yucheng Lu, Shunxin Wang, Dovile Juodelyte, Veronika Cheplygina
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10080
Pdf URL: https://arxiv.org/pdf/2504.10080
Copy Paste: [[2504.10080]] Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction(https://arxiv.org/abs/2504.10080)
Keywords: robust
Abstract: In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with the help of a ``domain discriminator'', aiming to improve model transparency in downstream tasks compared to existing black-box methods.

Title: CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

Authors: I-Sheng Fang, Jun-Cheng Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10090
Pdf URL: https://arxiv.org/pdf/2504.10090
Copy Paste: [[2504.10090]] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography(https://arxiv.org/abs/2504.10090)
Keywords: large language model
Abstract: Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.

Title: STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

Authors: Maxmilian Forstenhäusler, Daniel Külzer, Christos Anagnostopoulos, Shameem Puthiya Parambath, Natascha Weber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10097
Pdf URL: https://arxiv.org/pdf/2504.10097
Copy Paste: [[2504.10097]] STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data(https://arxiv.org/abs/2504.10097)
Keywords: transformer
Abstract: Accurate predictions using sequential spatiotemporal data are crucial for various applications. Utilizing real-world data, we aim to learn the intent of a smart device user within confined areas of a vehicle's surroundings. However, in real-world scenarios, environmental factors and sensor limitations result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we developed a Transformer-based approach, STaRFormer, which serves as a universal framework for sequential modeling. STaRFormer employs a novel, dynamic attention-based regional masking scheme combined with semi-supervised contrastive learning to enhance task-specific latent representations. Comprehensive experiments on 15 datasets varying in types (including non-stationary and irregularly sampled), domains, sequence lengths, training samples, and applications, demonstrate the efficacy and practicality of STaRFormer. We achieve notable improvements over state-of-the-art approaches. Code and data will be made available.

Title: Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution

Authors: Zexin Ji, Beiji Zou, Xiaoyan Kui, Sebastien Thureau, Su Ruan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10105
Pdf URL: https://arxiv.org/pdf/2504.10105
Copy Paste: [[2504.10105]] Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution(https://arxiv.org/abs/2504.10105)
Keywords: transformer
Abstract: Convolutional neural networks and Transformer have made significant progresses in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To fully integrate the reference image for low-resolution image super-resolution, we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images.

Title: Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Authors: Andreas Happe, Jürgen Cito
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10112
Pdf URL: https://arxiv.org/pdf/2504.10112
Copy Paste: [[2504.10112]] Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design(https://arxiv.org/abs/2504.10112)
Keywords: security, attack, large language model
Abstract: Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios.

Title: Universally Composable Commitments with Communicating Malicious Physically Uncloneable Functions

Authors: Lourenço Abecasis, Paulo Mateus, Chrysoula Vlachou
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10120
Pdf URL: https://arxiv.org/pdf/2504.10120
Copy Paste: [[2504.10120]] Universally Composable Commitments with Communicating Malicious Physically Uncloneable Functions(https://arxiv.org/abs/2504.10120)
Keywords: secure, security, attack
Abstract: In this work, we explore the possibility of universally composable (UC)-secure commitments using Physically Uncloneable Functions (PUFs) within a new adversarial model. We introduce the communicating malicious PUFs, i.e. malicious PUFs that can interact with their creator even when not in their possession, obtaining a stronger adversarial model. Prior work [ASIACRYPT 2013, LNCS, vol. 8270, pp. 100-119] proposed a compiler for constructing UC-secure commitments from ideal extractable commitments, and our task would be to adapt the ideal extractable commitment scheme proposed therein to our new model. However, we found an attack and identified a few other issues in that construction, and to address them, we modified the aforementioned ideal extractable commitment scheme and introduced new properties and tools that allow us to rigorously develop and present security proofs in this context. We propose a new UC-secure commitment scheme against adversaries that can only create stateless malicious PUFs which can receive, but not send, information from their creators. Our protocol is more efficient compared to previous proposals, as we have parallelized the ideal extractable commitments within it. The restriction to stateless malicious PUFs is significant, mainly since the protocol from [ASIACRYPT 2013, LNCS, vol. 8270, pp. 100-119] assumes malicious PUFs with unbounded state, thus limiting its applicability. However it is the only way we found to address the issues of the original construction. We hope that in future work this restriction can be lifted, and along the lines of our work, UC-secure commitments with fewer restrictions on both the state and communication can be constructed.

Title: M2S-RoAD: Multi-Modal Semantic Segmentation for Road Damage Using Camera and LiDAR Data

Authors: Tzu-Yun Tseng, Hongyu Lyu, Josephine Li, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10123
Pdf URL: https://arxiv.org/pdf/2504.10123
Copy Paste: [[2504.10123]] M2S-RoAD: Multi-Modal Semantic Segmentation for Road Damage Using Camera and LiDAR Data(https://arxiv.org/abs/2504.10123)
Keywords: segmentation
Abstract: Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.

Title: Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

Authors: Chunyang Zhang, Zhenhong Sun, Zhicheng Zhang, Junyan Wang, Yu Zhang, Dong Gong, Huadong Mo, Daoyi Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10148
Pdf URL: https://arxiv.org/pdf/2504.10148
Copy Paste: [[2504.10148]] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers(https://arxiv.org/abs/2504.10148)
Keywords: diffusion, transformer
Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.

Title: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Authors: Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Jingxuan Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.10157
Pdf URL: https://arxiv.org/pdf/2504.10157
Copy Paste: [[2504.10157]] SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users(https://arxiv.org/abs/2504.10157)
Keywords: large language model
Abstract: Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

Title: COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Authors: Jiansheng Li, Xingxuan Zhang, Hao Zou, Yige Guo, Renzhe Xu, Yilong Liu, Chuzhao Zhu, Yue He, Peng Cui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10158
Pdf URL: https://arxiv.org/pdf/2504.10158
Copy Paste: [[2504.10158]] COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts(https://arxiv.org/abs/2504.10158)
Keywords: robust, large language model
Abstract: Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.

Title: MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

Authors: Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10160
Pdf URL: https://arxiv.org/pdf/2504.10160
Copy Paste: [[2504.10160]] MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning(https://arxiv.org/abs/2504.10160)
Keywords: robust, large language model
Abstract: Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at this https URL.

Title: WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs

Authors: Nguyen Ngoc Dat, Tom Richardson, Matthew Watson, Kilian Meier, Jenna Kline, Sid Reid, Guy Maalouf, Duncan Hine, Majid Mirmehdi, Tilo Burghardt
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10165
Pdf URL: https://arxiv.org/pdf/2504.10165
Copy Paste: [[2504.10165]] WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs(https://arxiv.org/abs/2504.10165)
Keywords: segmentation
Abstract: Live tracking of wildlife via high-resolution video processing directly onboard drones is widely unexplored and most existing solutions rely on streaming video to ground stations to support navigation. Yet, both autonomous animal-reactive flight control beyond visual line of sight and/or mission-specific individual and behaviour recognition tasks rely to some degree on this capability. In response, we introduce WildLive -- a near real-time animal detection and tracking framework for high-resolution imagery running directly onboard uncrewed aerial vehicles (UAVs). The system performs multi-animal detection and tracking at 17fps+ for HD and 7fps+ on 4K video streams suitable for operation during higher altitude flights to minimise animal disturbance. Our system is optimised for Jetson Orin AGX onboard hardware. It integrates the efficiency of sparse optical flow tracking and mission-specific sampling with device-optimised and proven YOLO-driven object detection and segmentation techniques. Essentially, computational resource is focused onto spatio-temporal regions of high uncertainty to significantly improve UAV processing speeds without domain-specific loss of accuracy. Alongside, we introduce our WildLive dataset, which comprises 200k+ annotated animal instances across 19k+ frames from 4K UAV videos collected at the Ol Pejeta Conservancy in Kenya. All frames contain ground truth bounding boxes, segmentation masks, as well as individual tracklets and tracking point trajectories. We compare our system against current object tracking approaches including OC-SORT, ByteTrack, and SORT. Our multi-animal tracking experiments with onboard hardware confirm that near real-time high-resolution wildlife tracking is possible on UAVs whilst maintaining high accuracy levels as needed for future navigational and mission-specific animal-centric operational autonomy.

Title: C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Authors: Xu Zhang, Zhifei Liu, Jiahao Wang, Huixuan Zhang, Fan Xu, Junzhe Zhang, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10167
Pdf URL: https://arxiv.org/pdf/2504.10167
Copy Paste: [[2504.10167]] C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation(https://arxiv.org/abs/2504.10167)
Keywords: large language model
Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

Title: HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection

Authors: Mohamed A. Abdallah, Samhaa R. El-Beltagy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10168
Pdf URL: https://arxiv.org/pdf/2504.10168
Copy Paste: [[2504.10168]] HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection(https://arxiv.org/abs/2504.10168)
Keywords: robust, large language model
Abstract: In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system's retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.

Title: Challenges in interpretability of additive models

Authors: Xinyu Zhang, Julien Martinelli, ST John
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10169
Pdf URL: https://arxiv.org/pdf/2504.10169
Copy Paste: [[2504.10169]] Challenges in interpretability of additive models(https://arxiv.org/abs/2504.10169)
Keywords: interpretability
Abstract: We review generalized additive models as a type of ``transparent'' model that has recently seen renewed interest in the deep learning community as neural additive models. We highlight multiple types of nonidentifiability in this model class and discuss challenges in interpretability, arguing for restraint when claiming ``interpretability'' or ``suitability for safety-critical applications'' of such models.

Title: LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Authors: Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10185
Pdf URL: https://arxiv.org/pdf/2504.10185
Copy Paste: [[2504.10185]] LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks(https://arxiv.org/abs/2504.10185)
Keywords: attack, robust, large language model
Abstract: Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at this https URL.

Title: Efficient Generative Model Training via Embedded Representation Warmup

Authors: Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10188
Pdf URL: https://arxiv.org/pdf/2504.10188
Copy Paste: [[2504.10188]] Efficient Generative Model Training via Embedded Representation Warmup(https://arxiv.org/abs/2504.10188)
Keywords: diffusion, generative
Abstract: Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at this https URL.

Title: Differentially Private 2D Human Pose Estimation

Authors: Kaushik Bhargav Sivangi, Idris Zakariyya, Paul Henderson, Fani Deligianni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10190
Pdf URL: https://arxiv.org/pdf/2504.10190
Copy Paste: [[2504.10190]] Differentially Private 2D Human Pose Estimation(https://arxiv.org/abs/2504.10190)
Keywords: privacy, protect, transformer
Abstract: Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ($\epsilon=0.2$), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.

Title: Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

Authors: Veniamin Veselovsky, Berke Argin, Benedikt Stroebl, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, Arvind Narayanan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10191
Pdf URL: https://arxiv.org/pdf/2504.10191
Copy Paste: [[2504.10191]] Localized Cultural Knowledge is Conserved and Controllable in Large Language Models(https://arxiv.org/abs/2504.10191)
Keywords: large language model
Abstract: Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.

Title: DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation

Authors: Hanghui Guo, Jia Zhu, Shimin Di, Weijie Shi, Zhangze Chen, Jiajie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10198
Pdf URL: https://arxiv.org/pdf/2504.10198
Copy Paste: [[2504.10198]] DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation(https://arxiv.org/abs/2504.10198)
Keywords: large language model
Abstract: Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.

Title: VibrantLeaves: A principled parametric image generator for training deep restoration models

Authors: Raphael Achddou, Yann Gousseau, Saïd Ladjal, Sabine Süsstrunk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10201
Pdf URL: https://arxiv.org/pdf/2504.10201
Copy Paste: [[2504.10201]] VibrantLeaves: A principled parametric image generator for training deep restoration models(https://arxiv.org/abs/2504.10201)
Keywords: robust, explainability
Abstract: Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.

Title: Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection

Authors: Songze Li, Qixing Xu, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10214
Pdf URL: https://arxiv.org/pdf/2504.10214
Copy Paste: [[2504.10214]] Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection(https://arxiv.org/abs/2504.10214)
Keywords: robust
Abstract: The balance between stability and plasticity remains a fundamental challenge in pretrained model-based incremental object detection (PTMIOD). While existing PTMIOD methods demonstrate strong performance on in-domain tasks aligned with pretraining data, their plasticity to cross-domain scenarios remains underexplored. Through systematic component-wise analysis of pretrained detectors, we reveal a fundamental discrepancy: the localization modules demonstrate inherent cross-domain stability-preserving precise bounding box estimation across distribution shifts-while the classification components require enhanced plasticity to mitigate discriminability degradation in cross-domain scenarios. Motivated by these findings, we propose a dual-path framework built upon pretrained DETR-based detectors which decouples localization stability and classification plasticity: the localization path maintains stability to preserve pretrained localization knowledge, while the classification path facilitates plasticity via parameter-efficient fine-tuning and resists forgetting with pseudo-feature replay. Extensive evaluations on both in-domain (MS COCO and PASCAL VOC) and cross-domain (TT100K) benchmarks show state-of-the-art performance, demonstrating our method's ability to effectively balance stability and plasticity in PTMIOD, achieving robust cross-domain adaptation and strong retention of anti-forgetting capabilities.

Title: Probing then Editing Response Personality of Large Language Models

Authors: Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10227
Pdf URL: https://arxiv.org/pdf/2504.10227
Copy Paste: [[2504.10227]] Probing then Editing Response Personality of Large Language Models(https://arxiv.org/abs/2504.10227)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that exhibit consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in encoding personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly encode personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at this https URL.

Title: ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams

Authors: Vivek Yelleti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10229
Pdf URL: https://arxiv.org/pdf/2504.10229
Copy Paste: [[2504.10229]] ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams(https://arxiv.org/abs/2504.10229)
Keywords: robust
Abstract: Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the "cold-start" problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz.,, DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This "train-only-when-required" strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.

Title: A Model Zoo of Vision Transformers

Authors: Damian Falk, Léo Meynent, Florence Pfammatter, Konstantin Schürholt, Damian Borth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10231
Pdf URL: https://arxiv.org/pdf/2504.10231
Copy Paste: [[2504.10231]] A Model Zoo of Vision Transformers(https://arxiv.org/abs/2504.10231)
Keywords: transformer, generative
Abstract: The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at this http URL.

Title: CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

Authors: Tianyu Xin, Jin-Liang Xiao, Zeyu Xia, Shan Yin, Liang-Jian Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10242
Pdf URL: https://arxiv.org/pdf/2504.10242
Copy Paste: [[2504.10242]] CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data(https://arxiv.org/abs/2504.10242)
Keywords: extraction
Abstract: Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $\textit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model--although pre-trained on a specific dataset--achieves superior performance on datasets captured by other sensors; (2) $\textit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512\times512$ image within $\textit{0.4 seconds}$ and $4000\times4000$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.

Title: MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

Authors: Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Puhua Chen, Wenping Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10254
Pdf URL: https://arxiv.org/pdf/2504.10254
Copy Paste: [[2504.10254]] MASSeg : 2nd Technical Report for 4th PVUW MOSE Track(https://arxiv.org/abs/2504.10254)
Keywords: robust, segmentation
Abstract: Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

Title: XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Authors: Shuai Liu, Youmeng Li, Jizeng Wei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.10258
Pdf URL: https://arxiv.org/pdf/2504.10258
Copy Paste: [[2504.10258]] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark(https://arxiv.org/abs/2504.10258)
Keywords: robust, large language model, segmentation
Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

Title: Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study

Authors: Mengdi Wang, Efe Bozkir, Enkelejda Kasneci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10267
Pdf URL: https://arxiv.org/pdf/2504.10267
Copy Paste: [[2504.10267]] Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study(https://arxiv.org/abs/2504.10267)
Keywords: privacy, protect, attack, biometric, segmentation
Abstract: Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.

Title: LMFormer: Lane based Motion Prediction Transformer

Authors: Harsh Yadav, Maximilian Schaefer, Kun Zhao, Tobias Meisen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10275
Pdf URL: https://arxiv.org/pdf/2504.10275
Copy Paste: [[2504.10275]] LMFormer: Lane based Motion Prediction Transformer(https://arxiv.org/abs/2504.10275)
Keywords: explainability, transformer
Abstract: Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.

Title: DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Authors: Jinyue Zhang, Xiangrong Zhang, Zhongjian Huang, Tianyang Zhang, Yifei Jiang, Licheng Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10278
Pdf URL: https://arxiv.org/pdf/2504.10278
Copy Paste: [[2504.10278]] DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing(https://arxiv.org/abs/2504.10278)
Keywords: robust, diffusion
Abstract: Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.

Title: $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models

Authors: Chaoran Cheng, Jiahan Li, Jiajun Fan, Ge Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10283
Pdf URL: https://arxiv.org/pdf/2504.10283
Copy Paste: [[2504.10283]] $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models(https://arxiv.org/abs/2504.10283)
Keywords: generative
Abstract: Recent efforts have extended the flow-matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous-State Discrete Flow Matching (CS-DFM). Existing CS-DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS-DFM models, under which the existing variants can be understood as operating on different $\alpha$-representations of probabilities. Building upon the theory of information geometry, we introduce $\alpha$-Flow, a family of CS-DFM models that adheres to the canonical $\alpha$-geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for $\alpha$-flow establishes a unified variational bound for the discrete negative log-likelihood. We comprehensively evaluate different instantiations of $\alpha$-flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. $\alpha$-flow significantly outperforms its discrete-state counterpart in image and protein sequence generation and better captures the entropy in language modeling.

Title: ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

Authors: Huiqi Wu, Jianbo Mei, Yingjie Huang, Yining Xu, Jingjiao You, Yilong Liu, Li Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10316
Pdf URL: https://arxiv.org/pdf/2504.10316
Copy Paste: [[2504.10316]] ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting(https://arxiv.org/abs/2504.10316)
Keywords: robust
Abstract: In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

Title: Analysis of Attention in Video Diffusion Transformers

Authors: Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, Ashwinee Panda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10317
Pdf URL: https://arxiv.org/pdf/2504.10317
Copy Paste: [[2504.10317]] Analysis of Attention in Video Diffusion Transformers(https://arxiv.org/abs/2504.10317)
Keywords: diffusion, transformer
Abstract: We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

Title: Shield Bash: Abusing Defensive Coherence State Retrieval to Break Timing Obfuscation

Authors: Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, Pen Chung Yew
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2504.10318
Pdf URL: https://arxiv.org/pdf/2504.10318
Copy Paste: [[2504.10318]] Shield Bash: Abusing Defensive Coherence State Retrieval to Break Timing Obfuscation(https://arxiv.org/abs/2504.10318)
Keywords: security, protect, defense, attack
Abstract: Microarchitectural attacks are a significant concern, leading to many hardware-based defense proposals. However, different defenses target different classes of attacks, and their impact on each other has not been fully considered. To raise awareness of this problem, we study an interaction between two state-of-the art defenses in this paper, timing obfuscations of remote cache lines (TORC) and delaying speculative changes to remote cache lines (DSRC). TORC mitigates cache-hit based attacks and DSRC mitigates speculative coherence state change attacks. We observe that DSRC enables coherence information to be retrieved into the processor core, where it is out of the reach of timing obfuscations to protect. This creates an unforeseen consequence that redo operations can be triggered within the core to detect the presence or absence of remote cache lines, which constitutes a security vulnerability. We demonstrate that a new covert channel attack is possible using this vulnerability. We propose two ways to mitigate the attack, whose performance varies depending on an application's cache usage. One way is to never send remote exclusive coherence state (E) information to the core even if it is created. The other way is to never create a remote E state, which is responsible for triggering redos. We demonstrate the timing difference caused by this microarchitectural defense assumption violation using GEM5 simulations. Performance evaluation on SPECrate 2017 and PARSEC benchmarks of the two fixes show less than 32\% average overhead across both sets of benchmarks. The repair which prevented the creation of remote E state had less than 2.8% average overhead.

Title: SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model

Authors: Zongcan Ding, Haodong Zhang, Peng Wu, Guansong Pang, Zhiwei Yang, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10320
Pdf URL: https://arxiv.org/pdf/2504.10320
Copy Paste: [[2504.10320]] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model(https://arxiv.org/abs/2504.10320)
Keywords: robust, interpretability
Abstract: Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.

Title: LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

Authors: Hao Sun, Fenggen Yu, Huiyao Xu, Tao Zhang, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10331
Pdf URL: https://arxiv.org/pdf/2504.10331
Copy Paste: [[2504.10331]] LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis(https://arxiv.org/abs/2504.10331)
Keywords: diffusion
Abstract: Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

Title: MorphTok: Morphologically Grounded Tokenization for Indian Languages

Authors: Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10335
Pdf URL: https://arxiv.org/pdf/2504.10335
Copy Paste: [[2504.10335]] MorphTok: Morphologically Grounded Tokenization for Indian Languages(https://arxiv.org/abs/2504.10335)
Keywords: large language model, segmentation
Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

Title: Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families

Authors: Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10340
Pdf URL: https://arxiv.org/pdf/2504.10340
Copy Paste: [[2504.10340]] Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families(https://arxiv.org/abs/2504.10340)
Keywords: transformer, large language model
Abstract: Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings--extracted via an LLM-assisted annotation pipeline--serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

Title: VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Authors: Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10342
Pdf URL: https://arxiv.org/pdf/2504.10342
Copy Paste: [[2504.10342]] VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge(https://arxiv.org/abs/2504.10342)
Keywords: large language model
Abstract: Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

Title: Benchmarking 3D Human Pose Estimation Models Under Occlusions

Authors: Filipa Lino, Carlos Santiago, Manuel Marques
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10350
Pdf URL: https://arxiv.org/pdf/2504.10350
Copy Paste: [[2504.10350]] Benchmarking 3D Human Pose Estimation Models Under Occlusions(https://arxiv.org/abs/2504.10350)
Keywords: robust
Abstract: This paper addresses critical challenges in 3D Human Pose Estimation (HPE) by analyzing the robustness and sensitivity of existing models to occlusions, camera position, and action variability. Using a novel synthetic dataset, BlendMimic3D, which includes diverse scenarios with multi-camera setups and several occlusion types, we conduct specific tests on several state-of-the-art models. Our study focuses on the discrepancy in keypoint formats between common datasets such as Human3.6M, and 2D datasets such as COCO, commonly used for 2D detection models and frequently input of 3D HPE models. Our work explores the impact of occlusions on model performance and the generality of models trained exclusively under standard conditions. The findings suggest significant sensitivity to occlusions and camera settings, revealing a need for models that better adapt to real-world variability and occlusion scenarios. This research contributed to ongoing efforts to improve the fidelity and applicability of 3D HPE systems in complex environments.

Title: Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis

Authors: Kaiwen Zheng, Xuri Ge, Junchen Fu, Jun Peng, Joemon M. Jose
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10351
Pdf URL: https://arxiv.org/pdf/2504.10351
Copy Paste: [[2504.10351]] Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis(https://arxiv.org/abs/2504.10351)
Keywords: robust
Abstract: Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.

Title: DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

Authors: Aryan Shrivastava, Paula Akemi Aoyagui
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10359
Pdf URL: https://arxiv.org/pdf/2504.10359
Copy Paste: [[2504.10359]] DICE: A Framework for Dimensional and Contextual Evaluation of Language Models(https://arxiv.org/abs/2504.10359)
Keywords: robust
Abstract: Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.

Title: DUE: A Deep Learning Framework and Library for Modeling Unknown Equations

Authors: Junfeng Chen, Kailiang Wu, Dongbin Xiu
Subjects: cs.LG, math.DS, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10373
Pdf URL: https://arxiv.org/pdf/2504.10373
Copy Paste: [[2504.10373]] DUE: A Deep Learning Framework and Library for Modeling Unknown Equations(https://arxiv.org/abs/2504.10373)
Keywords: transformer
Abstract: Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown equations from measurement or simulation data. This paradigm shift, known as data-driven discovery or modeling, stands at the forefront of AI for science, with significant progress made in recent years. In this paper, we introduce a systematic framework for data-driven modeling of unknown equations using deep learning. This versatile framework is capable of learning unknown ODEs, PDEs, DAEs, IDEs, SDEs, reduced or partially observed systems, and non-autonomous differential equations. Based on this framework, we have developed Deep Unknown Equations (DUE), an open-source software package designed to facilitate the data-driven modeling of unknown equations using modern deep learning techniques. DUE serves as an educational tool for classroom instruction, enabling students and newcomers to gain hands-on experience with differential equations, data-driven modeling, and contemporary deep learning approaches such as FNN, ResNet, generalized ResNet, operator semigroup networks (OSG-Net), and Transformers. Additionally, DUE is a versatile and accessible toolkit for researchers across various scientific and engineering fields. It is applicable not only for learning unknown equations from data but also for surrogate modeling of known, yet complex, equations that are costly to solve using traditional numerical methods. We provide detailed descriptions of DUE and demonstrate its capabilities through diverse examples, which serve as templates that can be easily adapted for other applications.

Title: Ctrl-Z: Controlling AI Agents via Resampling

Authors: Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10374
Pdf URL: https://arxiv.org/pdf/2504.10374
Copy Paste: [[2504.10374]] Ctrl-Z: Controlling AI Agents via Resampling(https://arxiv.org/abs/2504.10374)
Keywords: security, defense, attack
Abstract: Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

Title: Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone

Authors: Pietro Bonazzi, Christian Vogt, Michael Jost, Lyes Khacef, Federico Paredes-Vallés, Michele Magno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10400
Pdf URL: https://arxiv.org/pdf/2504.10400
Copy Paste: [[2504.10400]] Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone(https://arxiv.org/abs/2504.10400)
Keywords: robust
Abstract: This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.

Title: Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Authors: Yan zhu, Jingyang zhu, Ting Wang, Yuanming Shi, Chunxiao Jiang, Khaled Ben Letaief
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2504.10403
Pdf URL: https://arxiv.org/pdf/2504.10403
Copy Paste: [[2504.10403]] Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks(https://arxiv.org/abs/2504.10403)
Keywords: privacy, federate
Abstract: Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

Title: Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Authors: Diogo Sousa, Guilherme Barbosa, Catarina Rocha, Dulce Oliveira
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10405
Pdf URL: https://arxiv.org/pdf/2504.10405
Copy Paste: [[2504.10405]] Performance of Large Language Models in Supporting Medical Diagnosis and Treatment(https://arxiv.org/abs/2504.10405)
Keywords: large language model
Abstract: The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.

Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Authors: Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10415
Pdf URL: https://arxiv.org/pdf/2504.10415
Copy Paste: [[2504.10415]] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models(https://arxiv.org/abs/2504.10415)
Keywords: large language model
Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

Title: CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Authors: Jing Chen, Zhihua Wei, Wei Zhang, Yingying Hu, Qiong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10418
Pdf URL: https://arxiv.org/pdf/2504.10418
Copy Paste: [[2504.10418]] CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation(https://arxiv.org/abs/2504.10418)
Keywords: large language model
Abstract: Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.

Title: Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA

Authors: Michał Turski, Mateusz Chiliński, Łukasz Borchmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10419
Pdf URL: https://arxiv.org/pdf/2504.10419
Copy Paste: [[2504.10419]] Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA(https://arxiv.org/abs/2504.10419)
Keywords: extraction, large language model
Abstract: Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: this https URL

Title: Can We Edit LLMs for Long-Tail Biomedical Knowledge?

Authors: Xinhao Yi, Jake Lever, Kevin Bryson, Zaiqiao Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10421
Pdf URL: https://arxiv.org/pdf/2504.10421
Copy Paste: [[2504.10421]] Can We Edit LLMs for Long-Tail Biomedical Knowledge?(https://arxiv.org/abs/2504.10421)
Keywords: large language model
Abstract: Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs' performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs' understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.

Title: LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Authors: Minqian Liu, Zhiyang Xu, Xinyi Zhang, Heajun An, Sarvech Qadir, Qi Zhang, Pamela J. Wisniewski, Jin-Hee Cho, Sang Won Lee, Ruoxi Jia, Lifu Huang
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10430
Pdf URL: https://arxiv.org/pdf/2504.10430
Copy Paste: [[2504.10430]] LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models(https://arxiv.org/abs/2504.10430)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Authors: Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10433
Pdf URL: https://arxiv.org/pdf/2504.10433
Copy Paste: [[2504.10433]] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model(https://arxiv.org/abs/2504.10433)
Keywords: diffusion, transformer
Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at this https URL.

Title: Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Authors: Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10434
Pdf URL: https://arxiv.org/pdf/2504.10434
Copy Paste: [[2504.10434]] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing(https://arxiv.org/abs/2504.10434)
Keywords: diffusion, generative
Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at this https URL

Title: Multimodal Long Video Modeling Based on Temporal Dynamic Context

Authors: Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2504.10443
Pdf URL: https://arxiv.org/pdf/2504.10443
Copy Paste: [[2504.10443]] Multimodal Long Video Modeling Based on Temporal Dynamic Context(https://arxiv.org/abs/2504.10443)
Keywords: transformer, large language model
Abstract: Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at this https URL.

Title: M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Authors: Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10449
Pdf URL: https://arxiv.org/pdf/2504.10449
Copy Paste: [[2504.10449]] M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models(https://arxiv.org/abs/2504.10449)
Keywords: transformer, large language model
Abstract: Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

Title: Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

Authors: Ramin Mousa, Hadis Taherinia, Khabiba Abdiyeva, Amir Ali Bengari, Mohammadmahdi Vahediahmar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10452
Pdf URL: https://arxiv.org/pdf/2504.10452
Copy Paste: [[2504.10452]] Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis(https://arxiv.org/abs/2504.10452)
Keywords: transformer
Abstract: Effective recognition of acute and difficult-to-heal wounds is a necessary step in wound diagnosis. An efficient classification model can help wound specialists classify wound types with less financial and time costs and also help in deciding on the optimal treatment method. Traditional machine learning models suffer from feature selection and are usually cumbersome models for accurate recognition. Recently, deep learning (DL) has emerged as a powerful tool in wound diagnosis. Although DL seems promising for wound type recognition, there is still a large scope for improving the efficiency and accuracy of the model. In this study, a DL-based multimodal classifier was developed using wound images and their corresponding locations to classify them into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map was also created to provide location data, which can help wound specialists label wound locations more effectively. The model uses a Vision Transformer to extract hierarchical features from input images, a Discrete Wavelet Transform (DWT) layer to capture low and high frequency components, and a Transformer to extract spatial features. The number of neurons and weight vector optimization were performed using three swarm-based optimization techniques (Monster Gorilla Toner (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox Optimization Algorithm). The evaluation results show that weight vector optimization using optimization algorithms can increase diagnostic accuracy and make it a very effective approach for wound detection. In the classification using the original body map, the proposed model was able to achieve an accuracy of 0.8123 using image data and an accuracy of 0.8007 using a combination of image data and wound location. Also, the accuracy of the model in combination with the optimization models varied from 0.7801 to 0.8342.

Title: GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Authors: Xiaobo Xia, Run Luo
Subjects: cs.CV, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10458
Pdf URL: https://arxiv.org/pdf/2504.10458
Copy Paste: [[2504.10458]] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents(https://arxiv.org/abs/2504.10458)
Keywords: large language model
Abstract: Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Title: The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Authors: Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10462
Pdf URL: https://arxiv.org/pdf/2504.10462
Copy Paste: [[2504.10462]] The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer(https://arxiv.org/abs/2504.10462)
Keywords: transformer, large language model, segmentation
Abstract: This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at this https URL.

Title: Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Authors: Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10465
Pdf URL: https://arxiv.org/pdf/2504.10465
Copy Paste: [[2504.10465]] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding(https://arxiv.org/abs/2504.10465)
Keywords: extraction, transformer, large language model, segmentation
Abstract: Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at this https URL.

Title: Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Authors: Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10466
Pdf URL: https://arxiv.org/pdf/2504.10466
Copy Paste: [[2504.10466]] Art3D: Training-Free 3D Generation from Flat-Colored Illustration(https://arxiv.org/abs/2504.10466)
Keywords: robust, generative
Abstract: Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: this https URL .

Title: MIEB: Massive Image Embedding Benchmark

Authors: Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10471
Pdf URL: https://arxiv.org/pdf/2504.10471
Copy Paste: [[2504.10471]] MIEB: Massive Image Embedding Benchmark(https://arxiv.org/abs/2504.10471)
Keywords: large language model
Abstract: Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at this https URL.

Title: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10479
Pdf URL: https://arxiv.org/pdf/2504.10479
Copy Paste: [[2504.10479]] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models(https://arxiv.org/abs/2504.10479)
Keywords: large language model
Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Title: REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Authors: Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10483
Pdf URL: https://arxiv.org/pdf/2504.10483
Copy Paste: [[2504.10483]] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers(https://arxiv.org/abs/2504.10483)
Keywords: diffusion, transformer
Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at this https URL.

Title: Decoupled Diffusion Sparks Adaptive Scene Generation

Authors: Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10485
Pdf URL: https://arxiv.org/pdf/2504.10485
Copy Paste: [[2504.10485]] Decoupled Diffusion Sparks Adaptive Scene Generation(https://arxiv.org/abs/2504.10485)
Keywords: diffusion
Abstract: Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

Title: FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Authors: Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10487
Pdf URL: https://arxiv.org/pdf/2504.10487
Copy Paste: [[2504.10487]] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation(https://arxiv.org/abs/2504.10487)
Keywords: segmentation
Abstract: Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of , a sketch of a , etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a ''free lunch'' to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under a low-data regime, where only a few unlabeled images are available. Our code is available at this https URL .