2025-04-16

Title: GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA

Authors: Gabriel Bo, Marc Bernardino, Justin Gu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10490
Pdf URL: https://arxiv.org/pdf/2504.10490
Copy Paste: [[2504.10490]] GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA(https://arxiv.org/abs/2504.10490)
Keywords: interpretability, transformer
Abstract: We explore the potential of integrating learnable and interpretable modules--specifically Kolmogorov-Arnold Networks (KAN) and graph-based representations--within a pre-trained GPT-2 model to enhance multi-task learning accuracy. Motivated by the recent surge in using KAN and graph attention (GAT) architectures in chain-of-thought (CoT) models and debates over their benefits compared to simpler architectures like MLPs, we begin by enhancing a standard self-attention transformer using Low-Rank Adaptation (LoRA), fine-tuning hyperparameters, and incorporating L2 regularization. This approach yields significant improvements. To further boost interpretability and richer representations, we develop two variants that attempt to improve the standard KAN and GAT: Graph LoRA and Hybrid-KAN LoRA (Learnable GPT). However, systematic evaluations reveal that neither variant outperforms the optimized LoRA-enhanced transformer, which achieves 55.249% accuracy on the SST test set, 99.18% on the CFIMDB dev set, and 89.9% paraphrase detection test accuracy. On sonnet generation, we get a CHRF score of 42.097. These findings highlight that efficient parameter adaptation via LoRA remains the most effective strategy for our tasks: sentiment analysis, paraphrase detection, and sonnet generation.

Title: LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections

Authors: Rita Sevastjanova, Robin Gerling, Thilo Spinner, Mennatallah El-Assady
Subjects: cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2504.10504
Pdf URL: https://arxiv.org/pdf/2504.10504
Copy Paste: [[2504.10504]] LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections(https://arxiv.org/abs/2504.10504)
Keywords: large language model
Abstract: Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users' interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.

Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Authors: Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10514
Pdf URL: https://arxiv.org/pdf/2504.10514
Copy Paste: [[2504.10514]] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness(https://arxiv.org/abs/2504.10514)
Keywords: robust
Abstract: Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

Title: Federated Learning with Layer Skipping: Efficient Training of Large Language Models for Healthcare NLP

Authors: Lihong Zhang, Yue Li
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10536
Pdf URL: https://arxiv.org/pdf/2504.10536
Copy Paste: [[2504.10536]] Federated Learning with Layer Skipping: Efficient Training of Large Language Models for Healthcare NLP(https://arxiv.org/abs/2504.10536)
Keywords: privacy, robust, federate, large language model
Abstract: Federated learning (FL) enables collaborative model training across organizations without sharing raw data, addressing crucial privacy concerns in healthcare natural language processing (NLP). However, training large language models (LLMs) in federated settings faces significant challenges, including communication overhead and data heterogeneity. We propose Layer-Skipping Federated Learning, where only selected layers of a pre-trained LLM are fine-tuned across clients while others remain frozen. Applied to LLaMA 3.2-1B, our approach reduces communication costs by approximately 70% while maintaining performance within 2% of centralized training. We evaluate our method on clinical NER and classification tasks using i2b2 and MIMIC-III datasets. Our experiments demonstrate that Layer-Skipping FL outperforms competitive baselines, handles non-IID clinical data distributions effectively, and shows robustness when combined with differential privacy. This approach represents a practical solution for privacy-preserving collaborative learning in healthcare NLP.

Title: MiMu: Mitigating Multiple Shortcut Learning Behavior of Transformers

Authors: Lili Zhao, Qi Liu, Wei Chen, Liyi Chen, Ruijun Sun, Min Hou, Yang Wang, Shijin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10551
Pdf URL: https://arxiv.org/pdf/2504.10551
Copy Paste: [[2504.10551]] MiMu: Mitigating Multiple Shortcut Learning Behavior of Transformers(https://arxiv.org/abs/2504.10551)
Keywords: robust, transformer
Abstract: Empirical Risk Minimization (ERM) models often rely on spurious correlations between features and labels during the learning process, leading to shortcut learning behavior that undermines robustness generalization performance. Current research mainly targets identifying or mitigating a single shortcut; however, in real-world scenarios, cues within the data are diverse and unknown. In empirical studies, we reveal that the models rely to varying extents on different shortcuts. Compared to weak shortcuts, models depend more heavily on strong shortcuts, resulting in their poor generalization ability. To address these challenges, we propose MiMu, a novel method integrated with Transformer-based ERMs designed to Mitigate Multiple shortcut learning behavior, which incorporates self-calibration strategy and self-improvement strategy. In the source model, we preliminarily propose the self-calibration strategy to prevent the model from relying on shortcuts and make overconfident predictions. Then, we further design self-improvement strategy in target model to reduce the reliance on multiple shortcuts. The random mask strategy involves randomly masking partial attention positions to diversify the focus of target model other than concentrating on a fixed region. Meanwhile, the adaptive attention alignment module facilitates the alignment of attention weights to the calibrated source model, without the need for post-hoc attention maps or supervision. Finally, extensive experiments conducted on Natural Language Processing (NLP) and Computer Vision (CV) demonstrate the effectiveness of MiMu in improving robustness generalization abilities.

Title: LEMUR Neural Network Dataset: Towards Seamless AutoML

Authors: Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Subjects: cs.LG, cs.AI, cs.CV, cs.DL
Abstract URL: https://arxiv.org/abs/2504.10552
Pdf URL: https://arxiv.org/pdf/2504.10552
Copy Paste: [[2504.10552]] LEMUR Neural Network Dataset: Towards Seamless AutoML(https://arxiv.org/abs/2504.10552)
Keywords: large language model, segmentation
Abstract: Neural networks are fundamental in artificial intelligence, driving progress in computer vision and natural language processing. High-quality datasets are crucial for their development, and there is growing interest in datasets composed of neural networks themselves to support benchmarking, automated machine learning (AutoML), and model analysis. We introduce LEMUR, an open source dataset of neural network models with well-structured code for diverse architectures across tasks such as object detection, image classification, segmentation, and natural language processing. LEMUR is primarily designed to enable fine-tuning of large language models (LLMs) for AutoML tasks, providing a rich source of structured model representations and associated performance data. Leveraging Python and PyTorch, LEMUR enables seamless extension to new datasets and models while maintaining consistency. It integrates an Optuna-powered framework for evaluation, hyperparameter optimization, statistical analysis, and graphical insights. LEMUR provides an extension that enables models to run efficiently on edge devices, facilitating deployment in resource-constrained environments. Providing tools for model evaluation, preprocessing, and database management, LEMUR supports researchers and practitioners in developing, testing, and analyzing neural networks. Additionally, it offers an API that delivers comprehensive information about neural network models and their complete performance statistics with a single request, which can be used in experiments with code-generating large language models. The LEMUR will be released as an open source project under the MIT license upon acceptance of the paper.

Title: Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains

Authors: Marco Salmè, Lorenzo Tronchin, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10555
Pdf URL: https://arxiv.org/pdf/2504.10555
Copy Paste: [[2504.10555]] Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains(https://arxiv.org/abs/2504.10555)
Keywords: privacy, robust, diffusion, generative
Abstract: Data scarcity remains a critical bottleneck impeding technological advancements across various domains, including but not limited to medicine and precision agriculture. To address this challenge, we explore the potential of Deep Generative Models (DGMs) in producing synthetic data that satisfies the Generative Learning Trilemma: fidelity, diversity, and sampling efficiency. However, recognizing that these criteria alone are insufficient for practical applications, we extend the trilemma to include utility, robustness, and privacy, factors crucial for ensuring the applicability of DGMs in real-world scenarios. Evaluating these metrics becomes particularly challenging in data-scarce environments, as DGMs traditionally rely on large datasets to perform optimally. This limitation is especially pronounced in domains like medicine and precision agriculture, where ensuring acceptable model performance under data constraints is vital. To address these challenges, we assess the Generative Learning Trilemma in data-scarcity settings using state-of-the-art evaluation metrics, comparing three prominent DGMs: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs). Furthermore, we propose a comprehensive framework to assess utility, robustness, and privacy in synthetic data generated by DGMs. Our findings demonstrate varying strengths among DGMs, with each model exhibiting unique advantages based on the application context. This study broadens the scope of the Generative Learning Trilemma, aligning it with real-world demands and providing actionable guidance for selecting DGMs tailored to specific applications.

Title: VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification

Authors: Lucas Heublein, Simon Kocher, Tobias Feigl, Alexander Rügamer, Christopher Mutschler, Felix Ott
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2504.10556
Pdf URL: https://arxiv.org/pdf/2504.10556
Copy Paste: [[2504.10556]] VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification(https://arxiv.org/abs/2504.10556)
Keywords: privacy, generative
Abstract: Distributed learning and Edge AI necessitate efficient data processing, low-latency communication, decentralized model training, and stringent data privacy to facilitate real-time intelligence on edge devices while reducing dependency on centralized infrastructure and ensuring high model performance. In the context of global navigation satellite system (GNSS) applications, the primary objective is to accurately monitor and classify interferences that degrade system performance in distributed environments, thereby enhancing situational awareness. To achieve this, machine learning (ML) models can be deployed on low-resource devices, ensuring minimal communication latency and preserving data privacy. The key challenge is to compress ML models while maintaining high classification accuracy. In this paper, we propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences. We demonstrate that the disentanglement approach can be leveraged for both data compression and data augmentation by interpolating the lower-dimensional latent representations of signal power. To validate our approach, we evaluate three VAE variants - vanilla, factorized, and conditional generative - on four distinct datasets, including two collected in controlled indoor environments and two real-world highway datasets. Additionally, we conduct extensive hyperparameter searches to optimize performance. Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.

Title: Efficient Process Reward Model Training via Active Learning

Authors: Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10559
Pdf URL: https://arxiv.org/pdf/2504.10559
Copy Paste: [[2504.10559]] Efficient Process Reward Model Training via Active Learning(https://arxiv.org/abs/2504.10559)
Keywords: large language model
Abstract: Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Title: Self-Controlled Dynamic Expansion Model for Continual Learning

Authors: Runqing Wu, Fei Ye, Rongyao Hu, Guoxi Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10561
Pdf URL: https://arxiv.org/pdf/2504.10561
Copy Paste: [[2504.10561]] Self-Controlled Dynamic Expansion Model for Continual Learning(https://arxiv.org/abs/2504.10561)
Keywords: transformer
Abstract: Continual Learning (CL) epitomizes an advanced training paradigm wherein prior data samples remain inaccessible during the acquisition of new tasks. Numerous investigations have delved into leveraging a pre-trained Vision Transformer (ViT) to enhance model efficacy in continual learning. Nonetheless, these approaches typically utilize a singular, static backbone, which inadequately adapts to novel tasks, particularly when engaging with diverse data domains, due to a substantial number of inactive parameters. This paper addresses this limitation by introducing an innovative Self-Controlled Dynamic Expansion Model (SCDEM), which orchestrates multiple distinct trainable pre-trained ViT backbones to furnish diverse and semantically enriched representations. Specifically, by employing the multi-backbone architecture as a shared module, the proposed SCDEM dynamically generates a new expert with minimal parameters to accommodate a new task. A novel Collaborative Optimization Mechanism (COM) is introduced to synergistically optimize multiple backbones by harnessing prediction signals from historical experts, thereby facilitating new task learning without erasing previously acquired knowledge. Additionally, a novel Feature Distribution Consistency (FDC) approach is proposed to align semantic similarity between previously and currently learned representations through an optimal transport distance-based mechanism, effectively mitigating negative knowledge transfer effects. Furthermore, to alleviate over-regularization challenges, this paper presents a novel Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to autonomously determine the penalization intensity on each trainable representation layer. An extensive series of experiments have been conducted to evaluate the proposed methodology's efficacy, with empirical results corroborating that the approach attains state-of-the-art performance.

Title: Data Augmentation Through Random Style Replacement

Authors: Qikai Yang, Cheng Ji, Huaiying Luo, Panfeng Li, Zhicheng Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10563
Pdf URL: https://arxiv.org/pdf/2504.10563
Copy Paste: [[2504.10563]] Data Augmentation Through Random Style Replacement(https://arxiv.org/abs/2504.10563)
Keywords: robust
Abstract: In this paper, we introduce a novel data augmentation technique that combines the advantages of style augmentation and random erasing by selectively replacing image subregions with style-transferred patches. Our approach first applies a random style transfer to training images, then randomly substitutes selected areas of these images with patches derived from the style-transferred versions. This method is able to seamlessly accommodate a wide range of existing style transfer algorithms and can be readily integrated into diverse data augmentation pipelines. By incorporating our strategy, the training process becomes more robust and less prone to overfitting. Comparative experiments demonstrate that, relative to previous style augmentation methods, our technique achieves superior performance and faster convergence.

Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Authors: Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10567
Pdf URL: https://arxiv.org/pdf/2504.10567
Copy Paste: [[2504.10567]] H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models(https://arxiv.org/abs/2504.10567)
Keywords: diffusion
Abstract: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

Title: Demo: ViolentUTF as An Accessible Platform for Generative AI Red Teaming

Authors: Tam n. Nguyen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10603
Pdf URL: https://arxiv.org/pdf/2504.10603
Copy Paste: [[2504.10603]] Demo: ViolentUTF as An Accessible Platform for Generative AI Red Teaming(https://arxiv.org/abs/2504.10603)
Keywords: security, attack, robust, generative
Abstract: The rapid integration of Generative AI (GenAI) into various applications necessitates robust risk management strategies which includes Red Teaming (RT) - an evaluation method for simulating adversarial attacks. Unfortunately, RT for GenAI is often hindered by technical complexity, lack of user-friendly interfaces, and inadequate reporting features. This paper introduces Violent UTF - an accessible, modular, and scalable platform for GenAI red teaming. Through intuitive interfaces (Web GUI, CLI, API, MCP) powered by LLMs and for LLMs, Violent UTF aims to empower non-technical domain experts and students alongside technical experts, facilitate comprehensive security evaluation by unifying capabilities from RT frameworks like Microsoft PyRIT, Nvidia Garak and its own specialized evaluators. ViolentUTF is being used for evaluating the robustness of a flagship LLM-based product in a large US Government department. It also demonstrates effectiveness in evaluating LLMs' cross-domain reasoning capability between cybersecurity and behavioral psychology.

Title: Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Authors: Michal Balcerak, Tamaz Amiranashvili, Suprosanna Shit, Antonio Terpin, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10612
Pdf URL: https://arxiv.org/pdf/2504.10612
Copy Paste: [[2504.10612]] Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling(https://arxiv.org/abs/2504.10612)
Keywords: generative
Abstract: Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that unifies flow-based approaches with the flexibility of energy-based models (EBMs). Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 generation (FID 3.97 compared to 8.61), while retaining the simulation-free training of transport-based approaches away from the data manifold. Additionally, we exploit the flexibility of our method and introduce an interaction energy for diverse mode exploration. Our approach focuses on learning a static scalar potential energy -- without time conditioning, auxiliary generators, or additional networks -- marking a significant departure from recent EBM methods. We believe this simplified framework significantly advances EBM capabilities and paves the way for their broader adoption in generative modeling across diverse domains.

Title: Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Authors: Thilo Hagendorff, Sarah Fabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10615
Pdf URL: https://arxiv.org/pdf/2504.10615
Copy Paste: [[2504.10615]] Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models(https://arxiv.org/abs/2504.10615)
Keywords: large language model
Abstract: Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities has been made by scaling test-time compute. However, understanding and quantifying model-internal reasoning abilities - the inferential "leaps" models make between individual token predictions - remains crucial. This study introduces a benchmark (n = 4,000 items) designed to quantify model-internal reasoning in different domains. We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. We evaluate a set of 18 LLMs, showing significant performance variations, with GPT-4.5 achieving the highest accuracy (74.7%), outperforming models like Grok-2 (67.2%), and Llama 3.1 405B (65.6%). Control experiments and difficulty scaling analyses suggest that while LLMs engage in internal reasoning, we cannot rule out heuristic exploitations under certain conditions, marking an area for future investigation. Our experiments demonstrate that LLMs can "think" via latent-space computations, revealing model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.

Title: Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

Authors: Chunzhuo Wang, Zhewen Xue, T. Sunil Kumar, Guido Camps, Hans Hallez, Bart Vanrumste
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10635
Pdf URL: https://arxiv.org/pdf/2504.10635
Copy Paste: [[2504.10635]] Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks(https://arxiv.org/abs/2504.10635)
Keywords: privacy, robust
Abstract: Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.

Title: Better Estimation of the KL Divergence Between Language Models

Authors: Afra Amini, Tim Vieira, Ryan Cotterell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10637
Pdf URL: https://arxiv.org/pdf/2504.10637
Copy Paste: [[2504.10637]] Better Estimation of the KL Divergence Between Language Models(https://arxiv.org/abs/2504.10637)
Keywords: interpretability
Abstract: Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

Title: Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning

Authors: Saif Punjwani, Larry Heck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10646
Pdf URL: https://arxiv.org/pdf/2504.10646
Copy Paste: [[2504.10646]] Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning(https://arxiv.org/abs/2504.10646)
Keywords: interpretability, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain-of-Thought (CoT). However, these approaches focus on token-level output without considering internal weight dynamics. We introduce Weight-of-Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph-based message passing, multi-step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.

Title: Relation-Rich Visual Document Generator for Visual Information Extraction

Authors: Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10659
Pdf URL: https://arxiv.org/pdf/2504.10659
Copy Paste: [[2504.10659]] Relation-Rich Visual Document Generator for Visual Information Extraction(https://arxiv.org/abs/2504.10659)
Keywords: extraction, large language model
Abstract: Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at this https URL .

Title: Perturbed State Space Feature Encoders for Optical Flow with Event Cameras

Authors: Gokul Raju Govinda Raju, Nikola Zubić, Marco Cannici, Davide Scaramuzza
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10669
Pdf URL: https://arxiv.org/pdf/2504.10669
Copy Paste: [[2504.10669]] Perturbed State Space Feature Encoders for Optical Flow with Event Cameras(https://arxiv.org/abs/2504.10669)
Keywords: transformer
Abstract: With their motion-responsive nature, event-based cameras offer significant advantages over traditional cameras for optical flow estimation. While deep learning has improved upon traditional methods, current neural networks adopted for event-based optical flow still face temporal and spatial reasoning limitations. We propose Perturbed State Space Feature Encoders (P-SSE) for multi-frame optical flow with event cameras to address these challenges. P-SSE adaptively processes spatiotemporal features with a large receptive field akin to Transformer-based methods, while maintaining the linear computational complexity characteristic of SSMs. However, the key innovation that enables the state-of-the-art performance of our model lies in our perturbation technique applied to the state dynamics matrix governing the SSM system. This approach significantly improves the stability and performance of our model. We integrate P-SSE into a framework that leverages bi-directional flows and recurrent connections, expanding the temporal context of flow prediction. Evaluations on DSEC-Flow and MVSEC datasets showcase P-SSE's superiority, with 8.48% and 11.86% improvements in EPE performance, respectively.

Title: Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning

Authors: Muhammad Al-Zafar Khan, Jamal Al-Karaki
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2504.10677
Pdf URL: https://arxiv.org/pdf/2504.10677
Copy Paste: [[2504.10677]] Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning(https://arxiv.org/abs/2504.10677)
Keywords: robust, diffusion
Abstract: In this paper, we present a multi-agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction-diffusion systems modeling molecular signaling, (2) neural-like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient tracking, neural synchronization, and robust penalties. A curriculum learning scheme guides the agent through progressively complex repair scenarios. In silico experiments demonstrate emergent repair strategies, including dynamic secretion control and spatial coordination.

Title: Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Authors: F.A. Rizvi, T. Navojith, A.M.N.H. Adhikari, W.P.U. Senevirathna, Dharshana Kasthurirathna, Lakmini Abeywardhana
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10679
Pdf URL: https://arxiv.org/pdf/2504.10679
Copy Paste: [[2504.10679]] Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content(https://arxiv.org/abs/2504.10679)
Keywords: extraction, transformer
Abstract: Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

Title: EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration

Authors: Soham Shah, Kumar Shridhar, Surojit Chatterjee, Souvik Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10681
Pdf URL: https://arxiv.org/pdf/2504.10681
Copy Paste: [[2504.10681]] EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration(https://arxiv.org/abs/2504.10681)
Keywords: robust, large language model
Abstract: While recent advances in large language models (LLMs) have significantly enhanced performance across diverse natural language tasks, the high computational and financial costs associated with their deployment remain substantial barriers. Existing routing strategies partially alleviate this challenge by assigning queries to cheaper or specialized models, but they frequently rely on extensive labeled data or fragile task-specific heuristics. Conversely, fusion techniques aggregate multiple LLM outputs to boost accuracy and robustness, yet they often exacerbate cost and may reinforce shared biases. We introduce EMAFusion, a new framework that self-optimizes for seamless LLM selection and reliable execution for a given query. Specifically, EMAFusion integrates a taxonomy-based router for familiar query types, a learned router for ambiguous inputs, and a cascading approach that progressively escalates from cheaper to more expensive models based on multi-judge confidence evaluations. Through extensive evaluations, we find EMAFusion outperforms the best individual models by over 2.6 percentage points (94.3% vs. 91.7%), while being 4X cheaper than the average cost. EMAFusion further achieves a remarkable 17.1 percentage point improvement over models like GPT-4 at less than 1/20th the cost. Our combined routing approach delivers 94.3% accuracy compared to taxonomy-based (88.1%) and learned model predictor-based (91.7%) methods alone, demonstrating the effectiveness of our unified strategy. Finally, EMAFusion supports flexible cost-accuracy trade-offs, allowing users to balance their budgetary constraints and performance needs.

Title: The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang, Haibo Lei, Qifang Gao, Yaqing Li, Weihua Luo, Tsing Li, Qing Wang, Yi Liu, Yang Wang, Hongyu An, Liou Zhang, Shijie Zhao, Lianhong Song, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Jing Wei, Mengyang Wang, Ruilong Guo, Qian Wang, Qingliang Liu, Yang Cheng, Davinci, Enxuan Gu, Pinxin Liu, Yongsheng Yu, Hang Hua, Yunlong Tang, Shihao Wang, Yukun Yang, Zhiyu Zhang, Yukun Yang, Jiyu Wu, Jiancheng Huang, Yifan Liu, Yi Huang, Shifeng Chen, Rui Chen, Yi Feng, Mingxi Li, Cailu Wan, Xiangji Wu, Zibin Liu, Jinyang Zhong, Kihwan Yoon, Ganzorig Gankhuyag, Shengyun Zhong, Mingyang Wu, Renjie Li, Yushen Zuo, Zhengzhong Tu, Zongang Gao, Guannan Chen, Yuan Tian, Wenhui Chen, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Yilin Zhang, Huan Zheng, Yanyan Wei, Wenxuan Zhao, Suiyi Zhao, Fei Wang, Kun Li, Yinggan Tang, Mengjie Su, Jae-hyeon Lee, Dong-Hyeop Son, Ui-Jin Choi, Tiancheng Shao, Yuqing Zhang, Mengcheng Ma
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10686
Pdf URL: https://arxiv.org/pdf/2504.10686
Copy Paste: [[2504.10686]] The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report(https://arxiv.org/abs/2504.10686)
Keywords: robust
Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.

Title: The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Authors: Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.10694
Pdf URL: https://arxiv.org/pdf/2504.10694
Copy Paste: [[2504.10694]] The Jailbreak Tax: How Useful are Your Jailbreak Outputs?(https://arxiv.org/abs/2504.10694)
Keywords: attack, large language model
Abstract: Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at this https URL

Title: Optimising Intrusion Detection Systems in Cloud-Edge Continuum with Knowledge Distillation for Privacy-Preserving and Efficient Communication

Authors: Soad Almabdy, Amjad Ullah
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10698
Pdf URL: https://arxiv.org/pdf/2504.10698
Copy Paste: [[2504.10698]] Optimising Intrusion Detection Systems in Cloud-Edge Continuum with Knowledge Distillation for Privacy-Preserving and Efficient Communication(https://arxiv.org/abs/2504.10698)
Keywords: secure, privacy, protect, attack, federate
Abstract: The growth of the Internet of Things has amplified the need for secure data interactions in cloud-edge ecosystems, where sensitive information is constantly processed across various system layers. Intrusion detection systems are commonly used to protect such environments from malicious attacks. Recently, Federated Learning has emerged as an effective solution for implementing intrusion detection systems, owing to its decentralised architecture that avoids sharing raw data with a central server, thereby enhancing data privacy. Despite its benefits, Federated Learning faces criticism for high communication overhead from frequent model updates, especially in large-scale Cloud-Edge infrastructures. This paper explores Knowledge Distillation to reduce communication overhead in Cloud-Edge intrusion detection while preserving accuracy and data privacy. Experiments show significant improvements over state-of-the-art methods.

Title: Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors

Authors: Francesco Marchiori, Denis Donadel, Mauro Conti
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10713
Pdf URL: https://arxiv.org/pdf/2504.10713
Copy Paste: [[2504.10713]] Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors(https://arxiv.org/abs/2504.10713)
Keywords: security, large language model
Abstract: Common Vulnerability and Exposure (CVE) records are fundamental to cybersecurity, offering unique identifiers for publicly known software and system vulnerabilities. Each CVE is typically assigned a Common Vulnerability Scoring System (CVSS) score to support risk prioritization and remediation. However, score inconsistencies often arise due to subjective interpretations of certain metrics. As the number of new CVEs continues to grow rapidly, automation is increasingly necessary to ensure timely and consistent scoring. While prior studies have explored automated methods, the application of Large Language Models (LLMs), despite their recent popularity, remains relatively underexplored. In this work, we evaluate the effectiveness of LLMs in generating CVSS scores for newly reported vulnerabilities. We investigate various prompt engineering strategies to enhance their accuracy and compare LLM-generated scores against those from embedding-based models, which use vector representations classified via supervised learning. Our results show that while LLMs demonstrate potential in automating CVSS evaluation, embedding-based methods outperform them in scoring more subjective components, particularly confidentiality, integrity, and availability impacts. These findings underscore the complexity of CVSS scoring and suggest that combining LLMs with embedding-based methods could yield more reliable results across all scoring components.

Title: SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Authors: Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10716
Pdf URL: https://arxiv.org/pdf/2504.10716
Copy Paste: [[2504.10716]] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models(https://arxiv.org/abs/2504.10716)
Keywords: robust, diffusion
Abstract: Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in 360 head synthesis, while beating current state-of-the-art multiview diffusion models.

Title: FuzzSense: Towards A Modular Fuzzing Framework for Autonomous Driving Software

Authors: Andrew Roberts, Lorenz Teply, Mert D. Pese, Olaf Maennel, Mohammad Hamad, Sebastian Steinhorst
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10717
Pdf URL: https://arxiv.org/pdf/2504.10717
Copy Paste: [[2504.10717]] FuzzSense: Towards A Modular Fuzzing Framework for Autonomous Driving Software(https://arxiv.org/abs/2504.10717)
Keywords: robust
Abstract: Fuzz testing to find semantic control vulnerabilities is an essential activity to evaluate the robustness of autonomous driving (AD) software. Whilst there is a preponderance of disparate fuzzing tools that target different parts of the test environment, such as the scenario, sensors, and vehicle dynamics, there is a lack of fuzzing strategies that ensemble these fuzzers to enable concurrent fuzzing, utilizing diverse techniques and targets. This research proposes FuzzSense, a modular, black-box, mutation-based fuzzing framework that is architected to ensemble diverse AD fuzzing tools. To validate the utility of FuzzSense, a LiDAR sensor fuzzer was developed as a plug-in, and the fuzzer was implemented in the new AD simulation platform AWSIM and this http URL AD software platform. The results demonstrated that FuzzSense was able to find vulnerabilities in the new this http URL software. We contribute to FuzzSense open-source with the aim of initiating a conversation in the community on the design of AD-specific fuzzers and the establishment of a community fuzzing framework to better target the diverse technology base of autonomous vehicles.

Title: Leveraging Deep Operator Networks (DeepONet) for Acoustic Full Waveform Inversion (FWI)

Authors: Kamaljyoti Nath, Khemraj Shukla, Victor C. Tsai, Umair bin Waheed, Christian Huber, Omer Alpak, Chuen-Song Chen, Ligang Lu, Amik St-Cyr
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10720
Pdf URL: https://arxiv.org/pdf/2504.10720
Copy Paste: [[2504.10720]] Leveraging Deep Operator Networks (DeepONet) for Acoustic Full Waveform Inversion (FWI)(https://arxiv.org/abs/2504.10720)
Keywords: robust
Abstract: Full Waveform Inversion (FWI) is an important geophysical technique considered in subsurface property prediction. It solves the inverse problem of predicting high-resolution Earth interior models from seismic data. Traditional FWI methods are computationally demanding. Inverse problems in geophysics often face challenges of non-uniqueness due to limited data, as data are often collected only on the surface. In this study, we introduce a novel methodology that leverages Deep Operator Networks (DeepONet) to attempt to improve both the efficiency and accuracy of FWI. The proposed DeepONet methodology inverts seismic waveforms for the subsurface velocity field. This approach is able to capture some key features of the subsurface velocity field. We have shown that the architecture can be applied to noisy seismic data with an accuracy that is better than some other machine learning methods. We also test our proposed method with out-of-distribution prediction for different velocity models. The proposed DeepONet shows comparable and better accuracy in some velocity models than some other machine learning methods. To improve the FWI workflow, we propose using the DeepONet output as a starting model for conventional FWI and that it may improve FWI performance. While we have only shown that DeepONet facilitates faster convergence than starting with a homogeneous velocity field, it may have some benefits compared to other approaches to constructing starting models. This integration of DeepONet into FWI may accelerate the inversion process and may also enhance its robustness and reliability.

Title: HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Authors: Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10724
Pdf URL: https://arxiv.org/pdf/2504.10724
Copy Paste: [[2504.10724]] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving(https://arxiv.org/abs/2504.10724)
Keywords: large language model
Abstract: Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$\times$ throughput, 1.10$\times$ energy-efficiency, 1.39$\times$ lower response time, and 3.7$\times$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives.

Title: Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

Authors: Darryl Hannan, John Cooper, Dylan White, Timothy Doster, Henry Kvinge, Yijing Watkins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10727
Pdf URL: https://arxiv.org/pdf/2504.10727
Copy Paste: [[2504.10727]] Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization(https://arxiv.org/abs/2504.10727)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.

Title: PQ-CAN: A Framework for Simulating Post-Quantum Cryptography in Embedded Systems

Authors: Mauro Conti, Francesco Marchiori, Sebastiano Matarazzo, Marco Rubin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10730
Pdf URL: https://arxiv.org/pdf/2504.10730
Copy Paste: [[2504.10730]] PQ-CAN: A Framework for Simulating Post-Quantum Cryptography in Embedded Systems(https://arxiv.org/abs/2504.10730)
Keywords: security
Abstract: The rapid development of quantum computers threatens traditional cryptographic schemes, prompting the need for Post-Quantum Cryptography (PQC). Although the NIST standardization process has accelerated the development of such algorithms, their application in resource-constrained environments such as embedded systems remains a challenge. Automotive systems relying on the Controller Area Network (CAN) bus for communication are particularly vulnerable due to their limited computational capabilities, high traffic, and need for real-time response. These constraints raise concerns about the feasibility of implementing PQC in automotive environments, where legacy hardware and bit rate limitations must also be considered. In this paper, we introduce PQ-CAN, a modular framework for simulating the performance and overhead of PQC algorithms in embedded systems. We consider the automotive domain as our case study, testing a variety of PQC schemes under different scenarios. Our simulation enables the adjustment of embedded system computational capabilities and CAN bus bit rate constraints. We also provide insights into the trade-offs involved by analyzing each algorithm's security level and overhead for key encapsulation and digital signature. By evaluating the performance of these algorithms, we provide insights into their feasibility and identify the strengths and limitations of PQC in securing automotive communications in the post-quantum era.

Title: Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization

Authors: Timur Carstensen, Neeratyoy Mallik, Frank Hutter, Martin Rapp
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10735
Pdf URL: https://arxiv.org/pdf/2504.10735
Copy Paste: [[2504.10735]] Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization(https://arxiv.org/abs/2504.10735)
Keywords: transformer
Abstract: As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.

Title: CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

Authors: Ankit Kumar Shaw (1), Kun Jiang (1), Tuopu Wen (1), Chandan Kumar Sah (2), Yining Shi (1), Mengmeng Yang (1), Diange Yang (1), Xiaoli Lian (2) ((1) Tsinghua University, (2) Beihang University)
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10738
Pdf URL: https://arxiv.org/pdf/2504.10738
Copy Paste: [[2504.10738]] CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates(https://arxiv.org/abs/2504.10738)
Keywords: robust, large language model
Abstract: The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at this https URL

Title: Encryption scheme based on Automorphism Group of Hermitian Function Field with Homomorphic Encryption

Authors: Gennady Khalimov, Yevgen Kotukh
Subjects: cs.CR, math.GR
Abstract URL: https://arxiv.org/abs/2504.10747
Pdf URL: https://arxiv.org/pdf/2504.10747
Copy Paste: [[2504.10747]] Encryption scheme based on Automorphism Group of Hermitian Function Field with Homomorphic Encryption(https://arxiv.org/abs/2504.10747)
Keywords: attack
Abstract: This article proposes a comprehensive approach to implementing encryption schemes based on the automorphism group of the Hermitian function field. We utilize a three-parameter group with logarithmic representations outside the group's center. In this work, we enhance the Hermitian function field-based encryption scheme with homomorphic encryption capabilities, which constitutes a significant advantage of our implementation. Both the attack complexity and the encrypted message size are directly correlated with the order of the group.

Title: Real-time Seafloor Segmentation and Mapping

Authors: Michele Grimaldi, Nouf Alkaabi, Francesco Ruscio, Sebastian Realpe Rua, Rafael Garcia, Nuno Gracias
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10750
Pdf URL: https://arxiv.org/pdf/2504.10750
Copy Paste: [[2504.10750]] Real-time Seafloor Segmentation and Mapping(https://arxiv.org/abs/2504.10750)
Keywords: protect, segmentation
Abstract: Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts

Title: Minimal Sensing for Orienting a Solar Panel

Authors: Jeremy Klotz, Shree K. Nayar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10765
Pdf URL: https://arxiv.org/pdf/2504.10765
Copy Paste: [[2504.10765]] Minimal Sensing for Orienting a Solar Panel(https://arxiv.org/abs/2504.10765)
Keywords: robust
Abstract: A solar panel harvests the most energy when pointing in the direction that maximizes the total illumination (irradiance) falling on it. Given an arbitrary orientation of a panel and an arbitrary environmental illumination, we address the problem of finding the direction of maximum total irradiance. We develop a minimal sensing approach where measurements from just four photodetectors are used to iteratively vary the tilt of the panel to maximize the irradiance. Many environments produce irradiance functions with multiple local maxima. As a result, simply measuring the gradient of the irradiance function and applying gradient ascent will not work. We show that a larger, optimized tilt between the detectors and the panel is equivalent to blurring the irradiance function. This has the effect of eliminating local maxima and turning the irradiance function into a unimodal one, whose maximum can be found using gradient ascent. We show that there is a close relationship between our approach and scale space theory. We have collected a large dataset of high-dynamic range lighting environments in New York City, called \textit{UrbanSky}. We used this dataset to conduct simulations to verify the robustness of our approach. Finally, we have built a portable solar panel with four compact detectors and an actuator to conduct experiments in various real-world settings: direct sunlight, cloudy sky, urban settings with occlusions and shadows, and complex indoor lighting. In all cases, we show significant improvements in harvested energy compared to standard approaches for controlling the orientation of a solar panel.

Title: How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

Authors: Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10766
Pdf URL: https://arxiv.org/pdf/2504.10766
Copy Paste: [[2504.10766]] How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients(https://arxiv.org/abs/2504.10766)
Keywords: robust, large language model
Abstract: As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.

Title: The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks

Authors: Ralf Schmälzle, Sue Lim, Yuetong Du, Gary Bente
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10768
Pdf URL: https://arxiv.org/pdf/2504.10768
Copy Paste: [[2504.10768]] The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks(https://arxiv.org/abs/2504.10768)
Keywords: robust, large language model
Abstract: This paper examines the thin-slicing approach - the ability to make accurate judgments based on minimal information - in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (less than 10 percent of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.

Title: Collaborative Bayesian Optimization via Wasserstein Barycenters

Authors: Donglin Zhan, Haoting Zhang, Rhonda Righter, Zeyu Zheng, James Anderson
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2504.10770
Pdf URL: https://arxiv.org/pdf/2504.10770
Copy Paste: [[2504.10770]] Collaborative Bayesian Optimization via Wasserstein Barycenters(https://arxiv.org/abs/2504.10770)
Keywords: privacy
Abstract: Motivated by the growing need for black-box optimization and data privacy, we introduce a collaborative Bayesian optimization (BO) framework that addresses both of these challenges. In this framework agents work collaboratively to optimize a function they only have oracle access to. In order to mitigate against communication and privacy constraints, agents are not allowed to share their data but can share their Gaussian process (GP) surrogate models. To enable collaboration under these constraints, we construct a central model to approximate the objective function by leveraging the concept of Wasserstein barycenters of GPs. This central model integrates the shared models without accessing the underlying data. A key aspect of our approach is a collaborative acquisition function that balances exploration and exploitation, allowing for the optimization of decision variables collaboratively in each iteration. We prove that our proposed algorithm is asymptotically consistent and that its implementation via Monte Carlo methods is numerically accurate. Through numerical experiments, we demonstrate that our approach outperforms other baseline collaborative frameworks and is competitive with centralized approaches that do not consider data privacy.

Title: AtlasD: Automatic Local Symmetry Discovery

Authors: Manu Bhat, Jonghyun Park, Jianke Yang, Nima Dehmamy, Robin Walters, Rose Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10777
Pdf URL: https://arxiv.org/pdf/2504.10777
Copy Paste: [[2504.10777]] AtlasD: Automatic Local Symmetry Discovery(https://arxiv.org/abs/2504.10777)
Keywords: segmentation
Abstract: Existing symmetry discovery methods predominantly focus on global transformations across the entire system or space, but they fail to consider the symmetries in local neighborhoods. This may result in the reported symmetry group being a misrepresentation of the true symmetry. In this paper, we formalize the notion of local symmetry as atlas equivariance. Our proposed pipeline, automatic local symmetry discovery (AtlasD), recovers the local symmetries of a function by training local predictor networks and then learning a Lie group basis to which the predictors are equivariant. We demonstrate AtlasD is capable of discovering local symmetry groups with multiple connected components in top-quark tagging and partial differential equation experiments. The discovered local symmetry is shown to be a useful inductive bias that improves the performance of downstream tasks in climate segmentation and vision tasks.

Title: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

Authors: Jessica Lin, Amir Zeldes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10792
Pdf URL: https://arxiv.org/pdf/2504.10792
Copy Paste: [[2504.10792]] GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction(https://arxiv.org/abs/2504.10792)
Keywords: extraction, explainability
Abstract: Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at this https URL to support further research on graded salient entity extraction.

Title: Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies

Authors: Annabella Sakunkoo, Jonathan Sakunkoo
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10797
Pdf URL: https://arxiv.org/pdf/2504.10797
Copy Paste: [[2504.10797]] Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies(https://arxiv.org/abs/2504.10797)
Keywords: fair
Abstract: Across cultures, names tell a lot about their bearers as they carry deep personal and cultural significance. Names also serve as powerful signals of gender, race, and status in the social hierarchy - a pecking order in which individual positions shape others' expectations on their perceived competence and worth. With the widespread adoption of LLMs and as names are often an input for LLMs, it is crucial to evaluate whether LLMs may sort people into status positions based on first and last names and, if so, whether it is in an unfair, biased fashion. While prior work has primarily investigated biases in first names, little attention has been paid to last names and even less to the combined effects of first and last names. In this study, we conduct a large-scale analysis of name variations across 5 ethnicities to examine how AI exhibits name biases. Our study investigates three key characteristics of inequality and finds that LLMs reflect and reinforce status hierarchies based on names that signal gender and ethnicity as they encode differential expectations of competence, leadership, and economic potential. Contrary to the common assumption that AI tends to favor Whites, we show that East and, in some contexts, South Asian names receive higher rankings. We also disaggregate Asians, a population projected to be the largest immigrant group in the U.S. by 2055. Our results challenge the monolithic Asian model minority assumption, illustrating a more complex and stratified model of bias. Gender moderates biases, with girls facing unfair disadvantages in certain racial groups. Additionally, spanning cultural categories by adopting Western first names improves AI-perceived status for East and Southeast Asian students, particularly for girls. Our findings underscore the importance of intersectional and more nuanced understandings of race, gender, and mixed identities in the evaluation of LLMs.

Title: The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability

Authors: Jiani Liu, Zhiyuan Wang, Zeliang Zhang, Chao Huang, Susan Liang, Yunlong Tang, Chenliang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10804
Pdf URL: https://arxiv.org/pdf/2504.10804
Copy Paste: [[2504.10804]] The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability(https://arxiv.org/abs/2504.10804)
Keywords: attack, robust, transformer
Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.

Title: Power-scaled Bayesian Inference with Score-based Generative mModels

Authors: Huseyin Tuna Erdinc, Yunlin Zeng, Abhinav Prakash Gahlot, Felix J. Herrmann
Subjects: cs.LG, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2504.10807
Pdf URL: https://arxiv.org/pdf/2504.10807
Copy Paste: [[2504.10807]] Power-scaled Bayesian Inference with Score-based Generative mModels(https://arxiv.org/abs/2504.10807)
Keywords: generative
Abstract: We propose a score-based generative algorithm for sampling from power-scaled priors and likelihoods within the Bayesian inference framework. Our algorithm enables flexible control over prior-likelihood influence without requiring retraining for different power-scaling configurations. Specifically, we focus on synthesizing seismic velocity models conditioned on imaged seismic. Our method enables sensitivity analysis by sampling from intermediate power posteriors, allowing us to assess the relative influence of the prior and likelihood on samples of the posterior distribution. Through a comprehensive set of experiments, we evaluate the effects of varying the power parameter in different settings: applying it solely to the prior, to the likelihood of a Bayesian formulation, and to both simultaneously. The results show that increasing the power of the likelihood up to a certain threshold improves the fidelity of posterior samples to the conditioning data (e.g., seismic images), while decreasing the prior power promotes greater structural diversity among samples. Moreover, we find that moderate scaling of the likelihood leads to a reduced shot data residual, confirming its utility in posterior refinement.

Title: Tabular foundation model to detect empathy from visual cues

Authors: Md Rakibul Hasan, Shafin Rahman, Md Zakir Hossain, Aneesh Krishna, Tom Gedeon
Subjects: cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10808
Pdf URL: https://arxiv.org/pdf/2504.10808
Copy Paste: [[2504.10808]] Tabular foundation model to detect empathy from visual cues(https://arxiv.org/abs/2504.10808)
Keywords: privacy, large language model
Abstract: Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models $-$ TabPFN v2 and TabICL $-$ through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.

Title: GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

Authors: Christophe Bolduc, Yannick Hold-Geoffroy, Zhixin Shu, Jean-François Lalonde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10809
Pdf URL: https://arxiv.org/pdf/2504.10809
Copy Paste: [[2504.10809]] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR(https://arxiv.org/abs/2504.10809)
Keywords: diffusion
Abstract: We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.

Title: FlexiContracts: A Novel and Efficient Scheme for Upgrading Smart Contracts in Ethereum Blockchain

Authors: Tahrim Hossain, Sakib Hassan, Faisal Haque Bappy, Muhammad Nur Yanhaona, Sarker Ahmed Rumee, Moinul Zaber, Tariqul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10811
Pdf URL: https://arxiv.org/pdf/2504.10811
Copy Paste: [[2504.10811]] FlexiContracts: A Novel and Efficient Scheme for Upgrading Smart Contracts in Ethereum Blockchain(https://arxiv.org/abs/2504.10811)
Keywords: secure
Abstract: Blockchain technology has revolutionized contractual processes, enhancing efficiency and trust through smart contracts. Ethereum, as a pioneer in this domain, offers a platform for decentralized applications but is challenged by the immutability of smart contracts, which makes upgrades cumbersome. Existing design patterns, while addressing upgradability, introduce complexity, increased development effort, and higher gas costs, thus limiting their effectiveness. In response, we introduce FlexiContracts, an innovative scheme that reimagines the evolution of smart contracts on Ethereum. By enabling secure, in-place upgrades without losing historical data, FlexiContracts surpasses existing approaches, introducing a previously unexplored path in smart contract evolution. Its streamlined design transcends the limitations of current design patterns by simplifying smart contract development, eliminating the need for extensive upfront planning, and significantly reducing the complexity of the design process. This advancement fosters an environment for continuous improvement and adaptation to new requirements, redefining the possibilities for dynamic, upgradable smart contracts.

Title: FHBench: Towards Efficient and Personalized Federated Learning for Multimodal Healthcare

Authors: Penghao Wang, Qian Chen, Teng Zhang, Yingwei Zhang, Wang Lu, Yiqiang Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10817
Pdf URL: https://arxiv.org/pdf/2504.10817
Copy Paste: [[2504.10817]] FHBench: Towards Efficient and Personalized Federated Learning for Multimodal Healthcare(https://arxiv.org/abs/2504.10817)
Keywords: robust, federate
Abstract: Federated Learning (FL) has emerged as an effective solution for multi-institutional collaborations without sharing patient data, offering a range of methods tailored for diverse applications. However, real-world medical datasets are often multimodal, and computational resources are limited, posing significant challenges for existing FL approaches. Recognizing these limitations, we developed the Federated Healthcare Benchmark(FHBench), a benchmark specifically designed from datasets derived from real-world healthcare applications. FHBench encompasses critical diagnostic tasks across domains such as the nervous, cardiovascular, and respiratory systems and general pathology, providing comprehensive support for multimodal healthcare evaluations and filling a significant gap in existing benchmarks. Building on FHBench, we introduced Efficient Personalized Federated Learning with Adaptive LoRA(EPFL), a personalized FL framework that demonstrates superior efficiency and effectiveness across various healthcare modalities. Our results highlight the robustness of FHBench as a benchmarking tool and the potential of EPFL as an innovative approach to advancing healthcare-focused FL, addressing key limitations of existing methods.

Title: IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

Authors: Janna Bruner, Amit Moryossef, Lior Wolf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10822
Pdf URL: https://arxiv.org/pdf/2504.10822
Copy Paste: [[2504.10822]] IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism(https://arxiv.org/abs/2504.10822)
Keywords: diffusion, generative
Abstract: Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

Title: CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Authors: Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10823
Pdf URL: https://arxiv.org/pdf/2504.10823
Copy Paste: [[2504.10823]] CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives(https://arxiv.org/abs/2504.10823)
Keywords: large language model
Abstract: Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters' perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs' value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.

Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Authors: Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10825
Pdf URL: https://arxiv.org/pdf/2504.10825
Copy Paste: [[2504.10825]] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding(https://arxiv.org/abs/2504.10825)
Keywords: diffusion, segmentation
Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.

Title: LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

Authors: Hengyu Shi, Junhao Su, Huansheng Ning, Xiaoming Wei, Jialin Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10829
Pdf URL: https://arxiv.org/pdf/2504.10829
Copy Paste: [[2504.10829]] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation(https://arxiv.org/abs/2504.10829)
Keywords: generative, large language model
Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.

Title: Towards Spatially-Aware and Optimally Faithful Concept-Based Explanations

Authors: Shubham Kumar, Dwip Dalal, Narendra Ahuja
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10833
Pdf URL: https://arxiv.org/pdf/2504.10833
Copy Paste: [[2504.10833]] Towards Spatially-Aware and Optimally Faithful Concept-Based Explanations(https://arxiv.org/abs/2504.10833)
Keywords: robust
Abstract: Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) are a promising tool for generating semantic explanations of the decision-making processes in deep neural networks, having applications in both model improvement and understanding. It is vital that the explanation is accurate, or faithful, to the model, yet we identify several limitations of prior faithfulness metrics that inhibit an accurate evaluation; most notably, prior metrics involve only the set of concepts present, ignoring how they may be spatially distributed. We address these limitations with Surrogate Faithfulness (SF), an evaluation method that introduces a spatially-aware surrogate and two novel faithfulness metrics. Using SF, we produce Optimally Faithful (OF) explanations, where concepts are found that maximize faithfulness. Our experiments show that (1) adding spatial-awareness to prior U-CBEMs increases faithfulness in all cases; (2) OF produces significantly more faithful explanations than prior U-CBEMs (30% or higher improvement in error); (3) OF's learned concepts generalize well to out-of-domain data and are more robust to adversarial examples, where prior U-CBEMs struggle.

Title: LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

Authors: Sihang Chen, Lijun Yun, Ze Liu, JianFeng Zhu, Jie Chen, Hui Wang, Yueping Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10834
Pdf URL: https://arxiv.org/pdf/2504.10834
Copy Paste: [[2504.10834]] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation(https://arxiv.org/abs/2504.10834)
Keywords: robust, segmentation
Abstract: Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet's mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.

Title: Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

Authors: Phill Kyu Rhee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10845
Pdf URL: https://arxiv.org/pdf/2504.10845
Copy Paste: [[2504.10845]] Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators(https://arxiv.org/abs/2504.10845)
Keywords: transformer, generative, large language model
Abstract: Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.

Title: How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model?

Authors: Meiqi Liu, Zhuoqun Huang, Yue Xing
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.10850
Pdf URL: https://arxiv.org/pdf/2504.10850
Copy Paste: [[2504.10850]] How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model?(https://arxiv.org/abs/2504.10850)
Keywords: robust
Abstract: With the rise of powerful foundation models, a pre-training-fine-tuning paradigm becomes increasingly popular these days: A foundation model is pre-trained using a huge amount of data from various sources, and then the downstream users only need to fine-tune and adapt it to specific downstream tasks. However, due to the high computation complexity of adversarial training, it is not feasible to fine-tune the foundation model to improve its robustness on the downstream task. Observing the above challenge, we want to improve the downstream robustness without updating/accessing the weights in the foundation model. Inspired from existing literature in robustness inheritance (Kim et al., 2020), through theoretical investigation, we identify a close relationship between robust contrastive learning with the adversarial robustness of supervised learning. To further validate and utilize this theoretical insight, we design a simple-yet-effective robust auto-encoder as a data pre-processing method before feeding the data into the foundation model. The proposed approach has zero access to the foundation model when training the robust auto-encoder. Extensive experiments demonstrate the effectiveness of the proposed method in improving the robustness of downstream tasks, verifying the connection between the feature robustness (implied by small adversarial contrastive loss) and the robustness of the downstream task.

Title: ICAFS: Inter-Client-Aware Feature Selection for Vertical Federated Learning

Authors: Ruochen Jin, Boning Tong, Shu Yang, Bojian Hou, Li Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10851
Pdf URL: https://arxiv.org/pdf/2504.10851
Copy Paste: [[2504.10851]] ICAFS: Inter-Client-Aware Feature Selection for Vertical Federated Learning(https://arxiv.org/abs/2504.10851)
Keywords: federate
Abstract: Vertical federated learning (VFL) enables a paradigm for vertically partitioned data across clients to collaboratively train machine learning models. Feature selection (FS) plays a crucial role in Vertical Federated Learning (VFL) due to the unique nature that data are distributed across multiple clients. In VFL, different clients possess distinct subsets of features for overlapping data samples, making the process of identifying and selecting the most relevant features a complex yet essential task. Previous FS efforts have primarily revolved around intra-client feature selection, overlooking vital feature interaction across clients, leading to subpar model outcomes. We introduce ICAFS, a novel multi-stage ensemble approach for effective FS in VFL by considering inter-client interactions. By employing conditional feature synthesis alongside multiple learnable feature selectors, ICAFS facilitates ensemble FS over these selectors using synthetic embeddings. This method bypasses the limitations of private gradient sharing and allows for model training using real data with refined embeddings. Experiments on multiple real-world datasets demonstrate that ICAFS surpasses current state-of-the-art methods in prediction accuracy.

Title: Enhancing Features in Long-tailed Data Using Large Vision Mode

Authors: Pengxiao Han, Changkun Ye, Jinguang Tong, Cuicui Jiang, Jie Hong, Li Fang, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10852
Pdf URL: https://arxiv.org/pdf/2504.10852
Copy Paste: [[2504.10852]] Enhancing Features in Long-tailed Data Using Large Vision Mode(https://arxiv.org/abs/2504.10852)
Keywords: large language model
Abstract: Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

Title: PT-Mark: Invisible Watermarking for Text-to-image Diffusion Models via Semantic-aware Pivotal Tuning

Authors: Yaopeng Wang, Huiyu Xu, Zhibo Wang, Jiacheng Du, Zhichao Li, Yiming Li, Qiu Wang, Kui Ren
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10853
Pdf URL: https://arxiv.org/pdf/2504.10853
Copy Paste: [[2504.10853]] PT-Mark: Invisible Watermarking for Text-to-image Diffusion Models via Semantic-aware Pivotal Tuning(https://arxiv.org/abs/2504.10853)
Keywords: protect, robust, watermark, diffusion
Abstract: Watermarking for diffusion images has drawn considerable attention due to the widespread use of text-to-image diffusion models and the increasing need for their copyright protection. Recently, advanced watermarking techniques, such as Tree Ring, integrate watermarks by embedding traceable patterns (e.g., Rings) into the latent distribution during the diffusion process. Such methods disrupt the original semantics of the generated images due to the inevitable distribution shift caused by the watermarks, thereby limiting their practicality, particularly in digital art creation. In this work, we present Semantic-aware Pivotal Tuning Watermarks (PT-Mark), a novel invisible watermarking method that preserves both the semantics of diffusion images and the traceability of the watermark. PT-Mark preserves the original semantics of the watermarked image by gradually aligning the generation trajectory with the original (pivotal) trajectory while maintaining the traceable watermarks during whole diffusion denoising process. To achieve this, we first compute the salient regions of the watermark at each diffusion denoising step as a spatial prior to identify areas that can be aligned without disrupting the watermark pattern. Guided by the region, we then introduce an additional pivotal tuning branch that optimizes the text embedding to align the semantics while preserving the watermarks. Extensive evaluations demonstrate that PT-Mark can preserve the original semantics of the diffusion images while integrating robust watermarks. It achieves a 10% improvement in the performance of semantic preservation (i.e., SSIM, PSNR, and LPIPS) compared to state-of-the-art watermarking methods, while also showing comparable robustness against real-world perturbations and four times greater efficiency.

Title: LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Authors: Hanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, Mohsen Imani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10854
Pdf URL: https://arxiv.org/pdf/2504.10854
Copy Paste: [[2504.10854]] LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation(https://arxiv.org/abs/2504.10854)
Keywords: segmentation
Abstract: Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.

Title: DAAF:Degradation-Aware Adaptive Fusion Framework for Robust Infrared and Visible Images Fusion

Authors: Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuxin Jing, Yuhan Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10871
Pdf URL: https://arxiv.org/pdf/2504.10871
Copy Paste: [[2504.10871]] DAAF:Degradation-Aware Adaptive Fusion Framework for Robust Infrared and Visible Images Fusion(https://arxiv.org/abs/2504.10871)
Keywords: robust, extraction
Abstract: Existing infrared and visible image fusion(IVIF) algorithms often prioritize high-quality images, neglecting image degradation such as low light and noise, which limits the practical potential. This paper propose Degradation-Aware Adaptive image Fusion (DAAF), which achieves unified modeling of adaptive degradation optimization and image fusion. Specifically, DAAF comprises an auxiliary Adaptive Degradation Optimization Network (ADON) and a Feature Interactive Local-Global Fusion (FILGF) Network. Firstly, ADON includes infrared and visible-light branches. Within the infrared branch, frequency-domain feature decomposition and extraction are employed to isolate Gaussian and stripe noise. In the visible-light branch, Retinex decomposition is applied to extract illumination and reflectance components, enabling complementary enhancement of detail and illumination distribution. Subsequently, FILGF performs interactive multi-scale local-global feature fusion. Local feature fusion consists of intra-inter model feature complement, while global feature fusion is achieved through a interactive cross-model attention. Extensive experiments have shown that DAAF outperforms current IVIF algorithms in normal and complex degradation scenarios.

Title: Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

Authors: Tonko E. W. Bossen, Andreas Møgelmose, Ross Greer
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10873
Pdf URL: https://arxiv.org/pdf/2504.10873
Copy Paste: [[2504.10873]] Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles(https://arxiv.org/abs/2504.10873)
Keywords: robust
Abstract: In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

Title: Weather-Aware Object Detection Transformer for Domain Adaptation

Authors: Soheil Gharatappeh, Salimeh Sekeh, Vikas Dhiman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10877
Pdf URL: https://arxiv.org/pdf/2504.10877
Copy Paste: [[2504.10877]] Weather-Aware Object Detection Transformer for Domain Adaptation(https://arxiv.org/abs/2504.10877)
Keywords: robust, transformer
Abstract: RT-DETRs have shown strong performance across various computer vision tasks but are known to degrade under challenging weather conditions such as fog. In this work, we investigate three novel approaches to enhance RT-DETR robustness in foggy environments: (1) Domain Adaptation via Perceptual Loss, which distills domain-invariant features from a teacher network to a student using perceptual supervision; (2) Weather Adaptive Attention, which augments the attention mechanism with fog-sensitive scaling by introducing an auxiliary foggy image stream; and (3) Weather Fusion Encoder, which integrates a dual-stream encoder architecture that fuses clear and foggy image features via multi-head self and cross-attention. Despite the architectural innovations, none of the proposed methods consistently outperform the baseline RT-DETR. We analyze the limitations and potential causes, offering insights for future research in weather-aware object detection.

Title: Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content

Authors: Yilang Peng, Sijia Qian, Yingdan Lu, Cuihua Shen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10878
Pdf URL: https://arxiv.org/pdf/2504.10878
Copy Paste: [[2504.10878]] Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content(https://arxiv.org/abs/2504.10878)
Keywords: large language model
Abstract: In today's visually dominated social media landscape, predicting the perceived credibility of visual content and understanding what drives human judgment are crucial for countering misinformation. However, these tasks are challenging due to the diversity and richness of visual features. We introduce a Large Language Model (LLM)-informed feature discovery framework that leverages multimodal LLMs, such as GPT-4o, to evaluate content credibility and explain its reasoning. We extract and quantify interpretable features using targeted prompts and integrate them into machine learning models to improve credibility predictions. We tested this approach on 4,191 visual social media posts across eight topics in science, health, and politics, using credibility ratings from 5,355 crowdsourced workers. Our method outperformed zero-shot GPT-based predictions by 13 percent in R2, and revealed key features like information concreteness and image format. We discuss the implications for misinformation mitigation, visual credibility, and the role of LLMs in social science.

Title: Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task

Authors: Aviral Chharia, Tianyu Ren, Tomotake Furuhata, Kenji Shimada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10880
Pdf URL: https://arxiv.org/pdf/2504.10880
Copy Paste: [[2504.10880]] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task(https://arxiv.org/abs/2504.10880)
Keywords: robust
Abstract: Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: this https URL

Title: Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

Authors: Karan Jain, Mohammad Nayeem Teli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10883
Pdf URL: https://arxiv.org/pdf/2504.10883
Copy Paste: [[2504.10883]] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models(https://arxiv.org/abs/2504.10883)
Keywords: diffusion
Abstract: Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15\% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.

Title: CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors

Authors: Jiahuan Long, Wen Yao, Tingsong Jiang, Chao Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10888
Pdf URL: https://arxiv.org/pdf/2504.10888
Copy Paste: [[2504.10888]] CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors(https://arxiv.org/abs/2504.10888)
Keywords: attack, robust
Abstract: Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.

Title: Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization

Authors: Peiliang Gong, Emadeldeen Eldele, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10900
Pdf URL: https://arxiv.org/pdf/2504.10900
Copy Paste: [[2504.10900]] Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization(https://arxiv.org/abs/2504.10900)
Keywords: robust, transformer
Abstract: Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.

Title: InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

Authors: Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10905
Pdf URL: https://arxiv.org/pdf/2504.10905
Copy Paste: [[2504.10905]] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation(https://arxiv.org/abs/2504.10905)
Keywords: security, biometric, diffusion
Abstract: Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

Title: Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Authors: Changjiang Gao, Hankun Lin, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10906
Pdf URL: https://arxiv.org/pdf/2504.10906
Copy Paste: [[2504.10906]] Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From(https://arxiv.org/abs/2504.10906)
Keywords: interpretability, large language model
Abstract: The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at this https URL

Title: Towards A Universal Graph Structural Encoder

Authors: Jialin Chen, Haolan Zuo, Haoyu Peter Wang, Siqi Miao, Pan Li, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10917
Pdf URL: https://arxiv.org/pdf/2504.10917
Copy Paste: [[2504.10917]] Towards A Universal Graph Structural Encoder(https://arxiv.org/abs/2504.10917)
Keywords: transformer, large language model
Abstract: Recent advancements in large-scale pre-training have shown the potential to learn generalizable representations for downstream tasks. In the graph domain, however, capturing and transferring structural information across different graph domains remains challenging, primarily due to the inherent differences in topological patterns across various contexts. Additionally, most existing models struggle to capture the complexity of rich graph structures, leading to inadequate exploration of the embedding space. To address these challenges, we propose GFSE, a universal graph structural encoder designed to capture transferable structural patterns across diverse domains such as molecular graphs, social networks, and citation networks. GFSE is the first cross-domain graph structural encoder pre-trained with multiple self-supervised learning objectives. Built on a Graph Transformer, GFSE incorporates attention mechanisms informed by graph inductive bias, enabling it to encode intricate multi-level and fine-grained topological features. The pre-trained GFSE produces generic and theoretically expressive positional and structural encoding for graphs, which can be seamlessly integrated with various downstream graph feature encoders, including graph neural networks for vectorized features and Large Language Models for text-attributed graphs. Comprehensive experiments on synthetic and real-world datasets demonstrate GFSE's capability to significantly enhance the model's performance while requiring substantially less task-specific fine-tuning. Notably, GFSE achieves state-of-the-art performance in 81.6% evaluated cases, spanning diverse graph models and datasets, highlighting its potential as a powerful and versatile encoder for graph-structured data.

Title: Fast-Powerformer: A Memory-Efficient Transformer for Accurate Mid-Term Wind Power Forecasting

Authors: Mingyi Zhu, Zhaoxin Li, Qiao Lin, Li Ding
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.10923
Pdf URL: https://arxiv.org/pdf/2504.10923
Copy Paste: [[2504.10923]] Fast-Powerformer: A Memory-Efficient Transformer for Accurate Mid-Term Wind Power Forecasting(https://arxiv.org/abs/2504.10923)
Keywords: security, extraction, transformer
Abstract: Wind power forecasting (WPF), as a significant research topic within renewable energy, plays a crucial role in enhancing the security, stability, and economic operation of power grids. However, due to the high stochasticity of meteorological factors (e.g., wind speed) and significant fluctuations in wind power output, mid-term wind power forecasting faces a dual challenge of maintaining high accuracy and computational efficiency. To address these issues, this paper proposes an efficient and lightweight mid-term wind power forecasting model, termed Fast-Powerformer. The proposed model is built upon the Reformer architecture, incorporating structural enhancements such as a lightweight Long Short-Term Memory (LSTM) embedding module, an input transposition mechanism, and a Frequency Enhanced Channel Attention Mechanism (FECAM). These improvements enable the model to strengthen temporal feature extraction, optimize dependency modeling across variables, significantly reduce computational complexity, and enhance sensitivity to periodic patterns and dominant frequency components. Experimental results conducted on multiple real-world wind farm datasets demonstrate that the proposed Fast-Powerformer achieves superior prediction accuracy and operational efficiency compared to mainstream forecasting approaches. Furthermore, the model exhibits fast inference speed and low memory consumption, highlighting its considerable practical value for real-world deployment scenarios.

Title: Can LLMs Leverage Observational Data? Towards Data-Driven Causal Discovery with LLMs

Authors: Yuni Susanti, Michael Färber
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10936
Pdf URL: https://arxiv.org/pdf/2504.10936
Copy Paste: [[2504.10936]] Can LLMs Leverage Observational Data? Towards Data-Driven Causal Discovery with LLMs(https://arxiv.org/abs/2504.10936)
Keywords: large language model
Abstract: Causal discovery traditionally relies on statistical methods applied to observational data, often requiring large datasets and assumptions about underlying causal structures. Recent advancements in Large Language Models (LLMs) have introduced new possibilities for causal discovery by providing domain expert knowledge. However, it remains unclear whether LLMs can effectively process observational data for causal discovery. In this work, we explore the potential of LLMs for data-driven causal discovery by integrating observational data for LLM-based reasoning. Specifically, we examine whether LLMs can effectively utilize observational data through two prompting strategies: pairwise prompting and breadth first search (BFS)-based prompting. In both approaches, we incorporate the observational data directly into the prompt to assess LLMs' ability to infer causal relationships from such data. Experiments on benchmark datasets show that incorporating observational data enhances causal discovery, boosting F1 scores by up to 0.11 point using both pairwise and BFS LLM-based prompting, while outperforming traditional statistical causal discovery baseline by up to 0.52 points. Our findings highlight the potential and limitations of LLMs for data-driven causal discovery, demonstrating their ability to move beyond textual metadata and effectively interpret and utilize observational data for more informed causal reasoning. Our studies lays the groundwork for future advancements toward fully LLM-driven causal discovery.

Title: Improved MST3 Encryption scheme based on small Ree groups

Authors: Gennady Khalimov, Yevgen Kotukh
Subjects: cs.CR, math.GR
Abstract URL: https://arxiv.org/abs/2504.10947
Pdf URL: https://arxiv.org/pdf/2504.10947
Copy Paste: [[2504.10947]] Improved MST3 Encryption scheme based on small Ree groups(https://arxiv.org/abs/2504.10947)
Keywords: security, protect, attack, robust
Abstract: This article presents an encryption scheme based on the small Ree groups. We propose utilizing the small Ree group structure to enhance the overall security parameters of the encryption scheme. By extending the logarithmic signature to encompass the entire group and modifying the encryption algorithm, we have developed robust protection against sequential key recovery attacks.

Title: When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers

Authors: Hongkang Li, Yihua Zhang, Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10957
Pdf URL: https://arxiv.org/pdf/2504.10957
Copy Paste: [[2504.10957]] When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers(https://arxiv.org/abs/2504.10957)
Keywords: transformer, large language model
Abstract: Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors, each of which is the weight update from the pre-trained model to fine-tuned models for certain tasks. This approach recently gained attention as a computationally efficient inference method for model editing, e.g., multi-task learning, forgetting, and out-of-domain generalization capabilities. However, the theoretical understanding of why task vectors can execute various conceptual operations remains limited, due to the highly non-convexity of training Transformer-based models. To the best of our knowledge, this paper provides the first theoretical characterization of the generalization guarantees of task vector methods on nonlinear Transformers. We consider a conceptual learning setting, where each task is a binary classification problem based on a discriminative pattern. We theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or aligned tasks, as well as the success of task negation in unlearning one task from irrelevant or contradictory tasks. Moreover, we prove the proper selection of linear coefficients for task arithmetic to achieve guaranteed generalization to out-of-domain tasks. All of our theoretical results hold for both dense-weight parameters and their low-rank approximations. Although established in a conceptual setting, our theoretical findings were validated on a practical machine unlearning task using the large language model Phi-1.5 (1.3B).

Title: An Efficient and Mixed Heterogeneous Model for Image Restoration

Authors: Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10967
Pdf URL: https://arxiv.org/pdf/2504.10967
Copy Paste: [[2504.10967]] An Efficient and Mixed Heterogeneous Model for Image Restoration(https://arxiv.org/abs/2504.10967)
Keywords: extraction, transformer
Abstract: Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at this https URL.

Title: AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

Authors: Yihang Liu, Lianghua He, Ying Wen, Longzhen Yang, Hongzhou Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10972
Pdf URL: https://arxiv.org/pdf/2504.10972
Copy Paste: [[2504.10972]] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images(https://arxiv.org/abs/2504.10972)
Keywords: robust, transformer
Abstract: Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

Title: Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

Authors: Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10974
Pdf URL: https://arxiv.org/pdf/2504.10974
Copy Paste: [[2504.10974]] Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion(https://arxiv.org/abs/2504.10974)
Keywords: robust
Abstract: Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.

Title: Adaptive Decision Boundary for Few-Shot Class-Incremental Learning

Authors: Linhao Li, Yongzhang Tan, Siyuan Yang, Hao Cheng, Yongfeng Dong, Liang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10976
Pdf URL: https://arxiv.org/pdf/2504.10976
Copy Paste: [[2504.10976]] Adaptive Decision Boundary for Few-Shot Class-Incremental Learning(https://arxiv.org/abs/2504.10976)
Keywords: robust
Abstract: Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from a limited set of training samples without forgetting knowledge of previously learned classes. Conventional FSCIL methods typically build a robust feature extractor during the base training session with abundant training samples and subsequently freeze this extractor, only fine-tuning the classifier in subsequent incremental phases. However, current strategies primarily focus on preventing catastrophic forgetting, considering only the relationship between novel and base classes, without paying attention to the specific decision spaces of each class. To address this challenge, we propose a plug-and-play Adaptive Decision Boundary Strategy (ADBS), which is compatible with most FSCIL methods. Specifically, we assign a specific decision boundary to each class and adaptively adjust these boundaries during training to optimally refine the decision spaces for the classes in each session. Furthermore, to amplify the distinctiveness between classes, we employ a novel inter-class constraint loss that optimizes the decision boundaries and prototypes for each class. Extensive experiments on three benchmarks, namely CIFAR100, miniImageNet, and CUB200, demonstrate that incorporating our ADBS method with existing FSCIL techniques significantly improves performance, achieving overall state-of-the-art results.

Title: Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs

Authors: Yingjian Chen, Feiyang Li, Xingyu Song, Tianxiao Li, Issey Sudeka, Irene Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10982
Pdf URL: https://arxiv.org/pdf/2504.10982
Copy Paste: [[2504.10982]] Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs(https://arxiv.org/abs/2504.10982)
Keywords: privacy, large language model
Abstract: Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To bridge this gap, we are the first to explore a knowledge graph-based (KG) RAG framework for Japanese medical QA small-scale open-source LLMs. Experimental results show that KG-based RAG has only a limited impact on Japanese medical QA using small-scale open-source LLMs. Further case studies reveal that the effectiveness of the RAG is sensitive to the quality and relevance of the external retrieved content. These findings offer valuable insights into the challenges and potential of applying RAG in Japanese medical QA, while also serving as a reference for other low-resource languages.

Title: ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2504.10983
Pdf URL: https://arxiv.org/pdf/2504.10983
Copy Paste: [[2504.10983]] ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings(https://arxiv.org/abs/2504.10983)
Keywords: diffusion, generative
Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.

Title: Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera

Authors: Sami Arja, Nimrod Kruger, Alexandre Marcireau, Nicholas Owen Ralph, Saeed Afshar, Gregory Cohen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10984
Pdf URL: https://arxiv.org/pdf/2504.10984
Copy Paste: [[2504.10984]] Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera(https://arxiv.org/abs/2504.10984)
Keywords: robust
Abstract: Cephalopods exhibit unique colour discrimination capabilities despite having one type of photoreceptor, relying instead on chromatic aberration induced by their ocular optics and pupil shapes to perceive spectral information. We took inspiration from this biological mechanism to design a spectral imaging system that combines a ball lens with an event-based camera. Our approach relies on a motorised system that shifts the focal position, mirroring the adaptive lens motion in cephalopods. This approach has enabled us to achieve wavelength-dependent focusing across the visible light and near-infrared spectrum, making the event a spectral sensor. We characterise chromatic aberration effects, using both event-based and conventional frame-based sensors, validating the effectiveness of bio-inspired spectral discrimination both in simulation and in a real setup as well as assessing the spectral discrimination performance. Our proposed approach provides a robust spectral sensing capability without conventional colour filters or computational demosaicing. This approach opens new pathways toward new spectral sensing systems inspired by nature's evolutionary solutions. Code and analysis are available at: this https URL

Title: PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation

Authors: Bo-Cheng Hu, Ge-Peng Ji, Dian Shao, Deng-Ping Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10986
Pdf URL: https://arxiv.org/pdf/2504.10986
Copy Paste: [[2504.10986]] PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation(https://arxiv.org/abs/2504.10986)
Keywords: segmentation
Abstract: Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: this https URL.

Title: Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

Authors: Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.10987
Pdf URL: https://arxiv.org/pdf/2504.10987
Copy Paste: [[2504.10987]] Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation(https://arxiv.org/abs/2504.10987)
Keywords: secure, privacy
Abstract: Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

Title: TMCIR: Token Merge Benefits Composed Image Retrieval

Authors: Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, Shichao Kan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10995
Pdf URL: https://arxiv.org/pdf/2504.10995
Copy Paste: [[2504.10995]] TMCIR: Token Merge Benefits Composed Image Retrieval(https://arxiv.org/abs/2504.10995)
Keywords: diffusion
Abstract: Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.

Title: ReZero: Enhancing LLM search ability by trying one-more-time

Authors: Alan Dao (Gia Tuan Dao), Thinh Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11001
Pdf URL: https://arxiv.org/pdf/2504.11001
Copy Paste: [[2504.11001]] ReZero: Enhancing LLM search ability by trying one-more-time(https://arxiv.org/abs/2504.11001)
Keywords: robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Model (LLM) performance on knowledge-intensive tasks but depends heavily on initial search query quality. Current methods, often using Reinforcement Learning (RL), typically focus on query formulation or reasoning over results, without explicitly encouraging persistence after a failed search. We introduce ReZero (Retry-Zero), a novel RL framework that directly rewards the act of retrying a search query following an initial unsuccessful attempt. This incentivizes the LLM to explore alternative queries rather than prematurely halting. ReZero demonstrates significant improvement, achieving 46.88% accuracy compared to a 25% baseline. By rewarding persistence, ReZero enhances LLM robustness in complex information-seeking scenarios where initial queries may prove insufficient.

Title: Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Authors: Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11004
Pdf URL: https://arxiv.org/pdf/2504.11004
Copy Paste: [[2504.11004]] Dynamic Compressing Prompts for Efficient Inference of Large Language Models(https://arxiv.org/abs/2504.11004)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at this https URL.

Title: MediSee: Reasoning-based Pixel-level Perception in Medical Images

Authors: Qinyue Tong, Ziqian Lu, Jun Liu, Yangming Zheng, Zheming Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11008
Pdf URL: https://arxiv.org/pdf/2504.11008
Copy Paste: [[2504.11008]] MediSee: Reasoning-based Pixel-level Perception in Medical Images(https://arxiv.org/abs/2504.11008)
Keywords: segmentation
Abstract: Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.

Title: AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

Authors: Chenyang Zhu, Xing Zhang, Yuyang Sun, Ching-Chun Chang, Isao Echizen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11015
Pdf URL: https://arxiv.org/pdf/2504.11015
Copy Paste: [[2504.11015]] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era(https://arxiv.org/abs/2504.11015)
Keywords: diffusion
Abstract: Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in this https URL.

Title: DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environmen

Authors: Hyejin Lee, Seokjun Hong, Jeonghoon Song, Haechan Cho, Zhixiong Jin, Byeonghun Kim, Joobin Jin, Jaegyun Im, Byeongjoon Noh, Hwasoo Yeo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11019
Pdf URL: https://arxiv.org/pdf/2504.11019
Copy Paste: [[2504.11019]] DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environmen(https://arxiv.org/abs/2504.11019)
Keywords: extraction
Abstract: Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at this https URL.

Title: Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Authors: Andrea Simonelli, Norman Müller, Peter Kontschieder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11024
Pdf URL: https://arxiv.org/pdf/2504.11024
Copy Paste: [[2504.11024]] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation(https://arxiv.org/abs/2504.11024)
Keywords: transformer, segmentation
Abstract: The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at this https URL.

Title: Defending Against Frequency-Based Attacks with Diffusion Models

Authors: Fatemeh Amerehi, Patrick Healy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11034
Pdf URL: https://arxiv.org/pdf/2504.11034
Copy Paste: [[2504.11034]] Defending Against Frequency-Based Attacks with Diffusion Models(https://arxiv.org/abs/2504.11034)
Keywords: attack, robust, diffusion, generative
Abstract: Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.

Title: QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

Authors: Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Yu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11038
Pdf URL: https://arxiv.org/pdf/2504.11038
Copy Paste: [[2504.11038]] QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models(https://arxiv.org/abs/2504.11038)
Keywords: attack, robust
Abstract: In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at this https URL.

Title: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

Authors: Sukannya Purkayastha, Zhuang Li, Anne Lauscher, Lizhen Qu, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11042
Pdf URL: https://arxiv.org/pdf/2504.11042
Copy Paste: [[2504.11042]] LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews(https://arxiv.org/abs/2504.11042)
Keywords: large language model
Abstract: Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: this https URL)

Title: Leveraging LLMs and attention-mechanism for automatic annotation of historical maps

Authors: Yunshuang Yuan, Monika Sester
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11050
Pdf URL: https://arxiv.org/pdf/2504.11050
Copy Paste: [[2504.11050]] Leveraging LLMs and attention-mechanism for automatic annotation of historical maps(https://arxiv.org/abs/2504.11050)
Keywords: large language model
Abstract: Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores--84.2% for Wood and 72.0% for Settlement--along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.

Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

Authors: Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11055
Pdf URL: https://arxiv.org/pdf/2504.11055
Copy Paste: [[2504.11055]] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections(https://arxiv.org/abs/2504.11055)
Keywords: extraction
Abstract: Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.

Title: UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

Authors: Pedro Diaz-Garcia, Felix Escalona, Miguel Cazorla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11063
Pdf URL: https://arxiv.org/pdf/2504.11063
Copy Paste: [[2504.11063]] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques(https://arxiv.org/abs/2504.11063)
Keywords: robust, generative
Abstract: The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.

Title: Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage

Authors: Marco Micheletto, Giulia Orrù, Luca Ghiani, Gian Luca Marcialis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11066
Pdf URL: https://arxiv.org/pdf/2504.11066
Copy Paste: [[2504.11066]] Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage(https://arxiv.org/abs/2504.11066)
Keywords: security, attack
Abstract: Presentation Attack Detection (PAD) systems are usually designed independently of the fingerprint verification system. While this can be acceptable for use cases where specific user templates are not predetermined, it represents a missed opportunity to enhance security in scenarios where integrating PAD with the fingerprint verification system could significantly leverage users' templates, which are the real target of a potential presentation attack. This does not mean that a PAD should be specifically designed for such users; that would imply the availability of many enrolled users' PAI and, consequently, complexity, time, and cost increase. On the contrary, we propose to equip a basic PAD, designed according to the state of the art, with an innovative add-on module called the Closeness Binary Code (CC) module. The term "closeness" refers to a peculiar property of the bona fide-related features: in an Euclidean feature space, genuine fingerprints tend to cluster in a specific pattern. First, samples from the same finger are close to each other, then samples from other fingers of the same user and finally, samples from fingers of other users. This property is statistically verified in our previous publication, and further confirmed in this paper. It is independent of the user population and the feature set class, which can be handcrafted or deep network-based (embeddings). Therefore, the add-on can be designed without the need for the targeted user samples; moreover, it exploits her/his samples' "closeness" property during the verification stage. Extensive experiments on benchmark datasets and state-of-the-art PAD methods confirm the benefits of the proposed add-on, which can be easily coupled with the main PAD module integrated into the fingerprint verification system.

Title: Change State Space Models for Remote Sensing Change Detection

Authors: Elman Ghazaei, Erchan Aptoula
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11080
Pdf URL: https://arxiv.org/pdf/2504.11080
Copy Paste: [[2504.11080]] Change State Space Models for Remote Sensing Change Detection(https://arxiv.org/abs/2504.11080)
Keywords: robust, transformer
Abstract: Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at this https URL upon acceptance.

Title: FLSSM: A Federated Learning Storage Security Model with Homomorphic Encryption

Authors: Yang Li, Chunhe Xia, Chang Li, Xiaojian Li, Tianbo Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.11088
Pdf URL: https://arxiv.org/pdf/2504.11088
Copy Paste: [[2504.11088]] FLSSM: A Federated Learning Storage Security Model with Homomorphic Encryption(https://arxiv.org/abs/2504.11088)
Keywords: security, privacy, protect, attack, federate, fair
Abstract: Federated learning based on homomorphic encryption has received widespread attention due to its high security and enhanced protection of user data privacy. However, the characteristics of encrypted computation lead to three challenging problems: ``computation-efficiency", ``attack-tracing" and ``contribution-assessment". The first refers to the efficiency of encrypted computation during model aggregation, the second refers to tracing malicious attacks in an encrypted state, and the third refers to the fairness of contribution assessment for local models after encryption. This paper proposes a federated learning storage security model with homomorphic encryption (FLSSM) to protect federated learning model privacy and address the three issues mentioned above. First, we utilize different nodes to aggregate local models in parallel, thereby improving encrypted models' aggregation efficiency. Second, we introduce trusted supervise nodes to examine local models when the global model is attacked, enabling the tracing of malicious attacks under homomorphic encryption. Finally, we fairly reward local training nodes with encrypted local models based on trusted training time. Experiments on multiple real-world datasets show that our model significantly outperforms baseline models in terms of both efficiency and security metrics.

Title: Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

Authors: Jiaxin Huang, Sheng Miao, BangBnag Yang, Yuewen Ma, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11092
Pdf URL: https://arxiv.org/pdf/2504.11092
Copy Paste: [[2504.11092]] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting(https://arxiv.org/abs/2504.11092)
Keywords: robust, generative
Abstract: Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.

Title: Using LLMs as prompt modifier to avoid biases in AI image generators

Authors: René Peinl
Subjects: cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2504.11104
Pdf URL: https://arxiv.org/pdf/2504.11104
Copy Paste: [[2504.11104]] Using LLMs as prompt modifier to avoid biases in AI image generators(https://arxiv.org/abs/2504.11104)
Keywords: fair, diffusion, large language model
Abstract: This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at this https URL

Title: Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Authors: Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.11106
Pdf URL: https://arxiv.org/pdf/2504.11106
Copy Paste: [[2504.11106]] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models(https://arxiv.org/abs/2504.11106)
Keywords: secure, defense, attack, generative
Abstract: Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.

Title: KubeFence: Security Hardening of the Kubernetes Attack Surface

Authors: Carmine Cesarano, Roberto Natella
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.11126
Pdf URL: https://arxiv.org/pdf/2504.11126
Copy Paste: [[2504.11126]] KubeFence: Security Hardening of the Kubernetes Attack Surface(https://arxiv.org/abs/2504.11126)
Keywords: security, protect, attack
Abstract: Kubernetes (K8s) is widely used to orchestrate containerized applications, including critical services in domains such as finance, healthcare, and government. However, its extensive and feature-rich API interface exposes a broad attack surface, making K8s vulnerable to exploits of software vulnerabilities and misconfigurations. Even if K8s adopts role-based access control (RBAC) to manage access to K8s APIs, this approach lacks the granularity needed to protect specification attributes within API requests. This paper proposes a novel solution, KubeFence, which implements finer-grain API filtering tailored to specific client workloads. KubeFence analyzes Kubernetes Operators from trusted repositories and leverages their configuration files to restrict unnecessary features of the K8s API, to mitigate misconfigurations and vulnerabilities exploitable through the K8s API. The experimental results show that KubeFence can significantly reduce the attack surface and prevent attacks compared to RBAC.

Title: Taming Consistency Distillation for Accelerated Human Image Animation

Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11143
Pdf URL: https://arxiv.org/pdf/2504.11143
Copy Paste: [[2504.11143]] Taming Consistency Distillation for Accelerated Human Image Animation(https://arxiv.org/abs/2504.11143)
Keywords: diffusion
Abstract: Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.

Title: GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention

Authors: Mahir Gulzar, Yar Muhammad, Naveed Muhammad
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11150
Pdf URL: https://arxiv.org/pdf/2504.11150
Copy Paste: [[2504.11150]] GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention(https://arxiv.org/abs/2504.11150)
Keywords: robust
Abstract: Predicting future trajectories of surrounding vehicles heavily relies on what contextual information is given to a motion prediction model. The context itself can be static (lanes, regulatory elements, etc) or dynamic (traffic participants). This paper presents a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements. We follow the famous encoder-interactor-decoder architecture where the encoder encodes scene context using lightweight Gated Recurrent Units, the interactor applies cross-context attention over encoded scene features and graph goal proposals, and the decoder regresses multimodal trajectories via Laplacian Mixture Density Network from the aggregated encodings. Using cross-attention over graph-based goal proposals gives robust trajectory estimates since the model learns to attend to future goal-relevant scene elements for the intended agent. We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.

Title: SAR-to-RGB Translation with Latent Diffusion for Earth Observation

Authors: Kaan Aydin, Joelle Hanna, Damian Borth
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.11154
Pdf URL: https://arxiv.org/pdf/2504.11154
Copy Paste: [[2504.11154]] SAR-to-RGB Translation with Latent Diffusion for Earth Observation(https://arxiv.org/abs/2504.11154)
Keywords: diffusion
Abstract: Earth observation satellites like Sentinel-1 (S1) and Sentinel-2 (S2) provide complementary remote sensing (RS) data, but S2 images are often unavailable due to cloud cover or data gaps. To address this, we propose a diffusion model (DM)-based approach for SAR-to-RGB translation, generating synthetic optical images from SAR inputs. We explore three different setups: two using Standard Diffusion, which reconstruct S2 images by adding and removing noise (one without and one with class conditioning), and one using Cold Diffusion, which blends S2 with S1 before removing the SAR signal. We evaluate the generated images in downstream tasks, including land cover classification and cloud removal. While generated images may not perfectly replicate real S2 data, they still provide valuable information. Our results show that class conditioning improves classification accuracy, while cloud removal performance remains competitive despite our approach not being optimized for it. Interestingly, despite exhibiting lower perceptual quality, the Cold Diffusion setup performs well in land cover classification, suggesting that traditional quantitative evaluation metrics may not fully reflect the practical utility of generated images. Our findings highlight the potential of DMs for SAR-to-RGB translation in RS applications where RGB images are missing.

Title: TSAL: Few-shot Text Segmentation Based on Attribute Learning

Authors: Chenming Li, Chengxu Liu, Yuanting Fan, Xiao Jin, Xingsong Hou, Xueming Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11164
Pdf URL: https://arxiv.org/pdf/2504.11164
Copy Paste: [[2504.11164]] TSAL: Few-shot Text Segmentation Based on Attribute Learning(https://arxiv.org/abs/2504.11164)
Keywords: segmentation
Abstract: Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.

Title: Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

Authors: William Hackett, Lewis Birch, Stefan Trawicki, Neeraj Suri, Peter Garraghan
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11168
Pdf URL: https://arxiv.org/pdf/2504.11168
Copy Paste: [[2504.11168]] Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails(https://arxiv.org/abs/2504.11168)
Keywords: protect, attack, robust, large language model
Abstract: Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

Title: MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos

Authors: Laura De Grazia, Pol Pastells, Mauro Vázquez Chas, Desmond Elliott, Danae Sánchez Villegas, Mireia Farrús, Mariona Taulé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11169
Pdf URL: https://arxiv.org/pdf/2504.11169
Copy Paste: [[2504.11169]] MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos(https://arxiv.org/abs/2504.11169)
Keywords: large language model
Abstract: Sexism is generally defined as prejudice and discrimination based on sex or gender, affecting every sector of society, from social institutions to relationships and individual behavior. Social media platforms amplify the impact of sexism by conveying discriminatory content not only through text but also across multiple modalities, highlighting the critical need for a multimodal approach to the analysis of sexism online. With the rise of social media platforms where users share short videos, sexism is increasingly spreading through video content. Automatically detecting sexism in videos is a challenging task, as it requires analyzing the combination of verbal, audio, and visual elements to identify sexist content. In this study, (1) we introduce MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of $\approx$ 11 hours of videos extracted from TikTok and BitChute; (2) we propose an innovative annotation framework for analyzing the contribution of textual and multimodal labels in the classification of sexist and non-sexist content; and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs on the task of sexism detection. We find that visual information plays a key role in labeling sexist content for both humans and models. Models effectively detect explicit sexism; however, they struggle with implicit cases, such as stereotypes, instances where annotators also show low agreement. This highlights the inherent difficulty of the task, as identifying implicit sexism depends on the social and cultural context.

Title: TerraMind: Large-Scale Generative Multimodality for Earth Observation

Authors: Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11171
Pdf URL: https://arxiv.org/pdf/2504.11171
Copy Paste: [[2504.11171]] TerraMind: Large-Scale Generative Multimodality for Earth Observation(https://arxiv.org/abs/2504.11171)
Keywords: generative
Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.

Title: TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data

Authors: Benedikt Blumenstiel, Paolo Fraccaro, Valerio Marsocci, Johannes Jakubik, Stefano Maurogiovanni, Mikolaj Czerkawski, Rocco Sedona, Gabriele Cavallaro, Thomas Brunschwiler, Juan Bernabe-Moreno, Nicolas Longépé
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11172
Pdf URL: https://arxiv.org/pdf/2504.11172
Copy Paste: [[2504.11172]] TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data(https://arxiv.org/abs/2504.11172)
Keywords: robust
Abstract: Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9 million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training and fostering robust cross-modal correlation learning. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset will be made publicly available with a permissive license.

Title: Exploring Backdoor Attack and Defense for LLM-empowered Recommendations

Authors: Liangbo Ning, Wenqi Fan, Qing Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11182
Pdf URL: https://arxiv.org/pdf/2504.11182
Copy Paste: [[2504.11182]] Exploring Backdoor Attack and Defense for LLM-empowered Recommendations(https://arxiv.org/abs/2504.11182)
Keywords: security, defense, attack, large language model
Abstract: The fusion of Large Language Models (LLMs) with recommender systems (RecSys) has dramatically advanced personalized recommendations and drawn extensive attention. Despite the impressive progress, the safety of LLM-based RecSys against backdoor attacks remains largely under-explored. In this paper, we raise a new problem: Can a backdoor with a specific trigger be injected into LLM-based Recsys, leading to the manipulation of the recommendation responses when the backdoor trigger is appended to an item's title? To investigate the vulnerabilities of LLM-based RecSys under backdoor attacks, we propose a new attack framework termed Backdoor Injection Poisoning for RecSys (BadRec). BadRec perturbs the items' titles with triggers and employs several fake users to interact with these items, effectively poisoning the training set and injecting backdoors into LLM-based RecSys. Comprehensive experiments reveal that poisoning just 1% of the training data with adversarial examples is sufficient to successfully implant backdoors, enabling manipulation of recommendations. To further mitigate such a security threat, we propose a universal defense strategy called Poison Scanner (P-Scanner). Specifically, we introduce an LLM-based poison scanner to detect the poisoned items by leveraging the powerful language understanding and rich knowledge of LLMs. A trigger augmentation agent is employed to generate diverse synthetic triggers to guide the poison scanner in learning domain-specific knowledge of the poisoned item detection task. Extensive experiments on three real-world datasets validate the effectiveness of the proposed P-Scanner.

Title: Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting

Authors: Ej Zhou, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11183
Pdf URL: https://arxiv.org/pdf/2504.11183
Copy Paste: [[2504.11183]] Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting(https://arxiv.org/abs/2504.11183)
Keywords: fair
Abstract: Social bias in language models can potentially exacerbate social inequalities. Despite it having garnered wide attention, most research focuses on English data. In a low-resource scenario, the models often perform worse due to insufficient training data. This study aims to leverage high-resource language corpora to evaluate bias and experiment with debiasing methods in low-resource languages. We evaluated the performance of recent multilingual models in five languages: English (\textsc{eng}), Chinese (\textsc{zho}), Russian (\textsc{rus}), Indonesian (\textsc{ind}) and Thai (\textsc{tha}), and analyzed four bias dimensions: \textit{gender}, \textit{religion}, \textit{nationality}, and \textit{race-color}. By constructing multilingual bias evaluation datasets, this study allows fair comparisons between models across languages. We have further investigated three debiasing methods-\texttt{CDA}, \texttt{Dropout}, \texttt{SenDeb}-and demonstrated that debiasing methods from high-resource languages can be effectively transferred to low-resource ones, providing actionable insights for fairness research in multilingual NLP.

Title: Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items

Authors: Minjie Zou, Sahana Srinivasan, Thaddaeus Wai Soon Lo, Ke Zou, Gabriel Dawei Yang, Xuguang Ai, Hyunjae Kim, Maxwell Singer, Fares Antaki, Kelvin Li, Robert Chang, Marcus Tan, David Ziyou Chen, Dianbo Liu, Qingyu Chen, Yih Chung Tham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11186
Pdf URL: https://arxiv.org/pdf/2504.11186
Copy Paste: [[2504.11186]] Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items(https://arxiv.org/abs/2504.11186)
Keywords: large language model
Abstract: Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.

Title: R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

Authors: Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2504.11195
Pdf URL: https://arxiv.org/pdf/2504.11195
Copy Paste: [[2504.11195]] R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning(https://arxiv.org/abs/2504.11195)
Keywords: defense, attack, robust
Abstract: Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in this https URL.

Title: Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Authors: Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Jie Wu
Subjects: cs.LG, cs.AI, cs.DC, cs.IR
Abstract URL: https://arxiv.org/abs/2504.11197
Pdf URL: https://arxiv.org/pdf/2504.11197
Copy Paste: [[2504.11197]] Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance(https://arxiv.org/abs/2504.11197)
Keywords: privacy
Abstract: Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Title: Video Summarization with Large Language Models

Authors: Min Jung Lee, Dayoung Gong, Minsu Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11199
Pdf URL: https://arxiv.org/pdf/2504.11199
Copy Paste: [[2504.11199]] Video Summarization with Large Language Models(https://arxiv.org/abs/2504.11199)
Keywords: large language model
Abstract: The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.

Title: Slice+Slice Baby: Generating Last-Level Cache Eviction Sets in the Blink of an Eye

Authors: Bradley Morgan, Gal Horowitz, Sioli O'Connell, Stephan van Schaik, Chitchanok Chuengsatiansup, Daniel Genkin, Olaf Maennel, Paul Montague, Eyal Ronen, Yuval Yarom
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.11208
Pdf URL: https://arxiv.org/pdf/2504.11208
Copy Paste: [[2504.11208]] Slice+Slice Baby: Generating Last-Level Cache Eviction Sets in the Blink of an Eye(https://arxiv.org/abs/2504.11208)
Keywords: attack
Abstract: An essential step for mounting cache attacks is finding eviction sets, collections of memory locations that contend on cache space. On Intel processors, one of the main challenges for identifying contending addresses is the sliced cache design, where the processor hashes the physical address to determine where in the cache a memory location is stored. While past works have demonstrated that the hash function can be reversed, they also showed that it depends on physical address bits that the adversary does not know. In this work, we make three main contributions to the art of finding eviction sets. We first exploit microarchitectural races to compare memory access times and identify the cache slice to which an address maps. We then use the known hash function to both reduce the error rate in our slice identification method and to reduce the work by extrapolating slice mappings to untested memory addresses. Finally, we show how to propagate information on eviction sets across different page offsets for the hitherto unexplored case of non-linear hash functions. Our contributions allow for entire LLC eviction set generation in 0.7 seconds on the Intel i7-9850H and 1.6 seconds on the i9-10900K, both using non-linear functions. This represents a significant improvement compared to state-of-the-art techniques taking 9x and 10x longer, respectively.

Title: Diversity-Driven Learning: Tackling Spurious Correlations and Data Heterogeneity in Federated Models

Authors: Gergely D. Németh, Eros Fanì, Yeat Jeng Ng, Barbara Caputo, Miguel Ángel Lozano, Nuria Oliver, Novi Quadrianto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11216
Pdf URL: https://arxiv.org/pdf/2504.11216
Copy Paste: [[2504.11216]] Diversity-Driven Learning: Tackling Spurious Correlations and Data Heterogeneity in Federated Models(https://arxiv.org/abs/2504.11216)
Keywords: privacy, robust, federate
Abstract: Federated Learning (FL) enables decentralized training of machine learning models on distributed data while preserving privacy. However, in real-world FL settings, client data is often non-identically distributed and imbalanced, resulting in statistical data heterogeneity which impacts the generalization capabilities of the server's model across clients, slows convergence and reduces performance. In this paper, we address this challenge by first proposing a characterization of statistical data heterogeneity by means of 6 metrics of global and client attribute imbalance, class imbalance, and spurious correlations. Next, we create and share 7 computer vision datasets for binary and multiclass image classification tasks in Federated Learning that cover a broad range of statistical data heterogeneity and hence simulate real-world situations. Finally, we propose FedDiverse, a novel client selection algorithm in FL which is designed to manage and leverage data heterogeneity across clients by promoting collaboration between clients with complementary data distributions. Experiments on the seven proposed FL datasets demonstrate FedDiverse's effectiveness in enhancing the performance and robustness of a variety of FL methods while having low communication and computational overhead.

Title: 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians

Authors: Zeming wei, Junyi Lin, Yang Liu, Weixing Chen, Jingzhou Luo, Guanbin Li, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11218
Pdf URL: https://arxiv.org/pdf/2504.11218
Copy Paste: [[2504.11218]] 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians(https://arxiv.org/abs/2504.11218)
Keywords: robust
Abstract: 3D affordance reasoning is essential in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,677 Gaussian instances, 8,354 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities.

Title: CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

Authors: Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Xiangyang Xue, Yi Zhu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11230
Pdf URL: https://arxiv.org/pdf/2504.11230
Copy Paste: [[2504.11230]] CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image(https://arxiv.org/abs/2504.11230)
Keywords: robust, segmentation
Abstract: This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demonstrate that our method significantly outperforms the state-of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility. Our dataset, code and pre-trained models are available on the project page.

Title: Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

Authors: Elisa Ancarani, Julie Tores, Lucile Sassatelli, Rémy Sun, Hui-Yin Wu, Frédéric Precioso
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.11232
Pdf URL: https://arxiv.org/pdf/2504.11232
Copy Paste: [[2504.11232]] Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset(https://arxiv.org/abs/2504.11232)
Keywords: robust
Abstract: We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

Title: Reconstructing Fine-Grained Network Data using Autoencoder Architectures with Domain Knowledge Penalties

Authors: Mark Cheung, Sridhar Venkatesan
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2504.11255
Pdf URL: https://arxiv.org/pdf/2504.11255
Copy Paste: [[2504.11255]] Reconstructing Fine-Grained Network Data using Autoencoder Architectures with Domain Knowledge Penalties(https://arxiv.org/abs/2504.11255)
Keywords: security, privacy, attack
Abstract: The ability to reconstruct fine-grained network session data, including individual packets, from coarse-grained feature vectors is crucial for improving network security models. However, the large-scale collection and storage of raw network traffic pose significant challenges, particularly for capturing rare cyberattack samples. These challenges hinder the ability to retain comprehensive datasets for model training and future threat detection. To address this, we propose a machine learning approach guided by formal methods to encode and reconstruct network data. Our method employs autoencoder models with domain-informed penalties to impute PCAP session headers from structured feature representations. Experimental results demonstrate that incorporating domain knowledge through constraint-based loss terms significantly improves reconstruction accuracy, particularly for categorical features with session-level encodings. By enabling efficient reconstruction of detailed network sessions, our approach facilitates data-efficient model training while preserving privacy and storage efficiency.

Title: Enhanced Small Target Detection via Multi-Modal Fusion and Attention Mechanisms: A YOLOv5 Approach

Authors: Xiaoxiao Ma, Junxiong Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11262
Pdf URL: https://arxiv.org/pdf/2504.11262
Copy Paste: [[2504.11262]] Enhanced Small Target Detection via Multi-Modal Fusion and Attention Mechanisms: A YOLOv5 Approach(https://arxiv.org/abs/2504.11262)
Keywords: robust
Abstract: With the rapid development of information technology, modern warfare increasingly relies on intelligence, making small target detection critical in military applications. The growing demand for efficient, real-time detection has created challenges in identifying small targets in complex environments due to interference. To address this, we propose a small target detection method based on multi-modal image fusion and attention mechanisms. This method leverages YOLOv5, integrating infrared and visible light data along with a convolutional attention module to enhance detection performance. The process begins with multi-modal dataset registration using feature point matching, ensuring accurate network training. By combining infrared and visible light features with attention mechanisms, the model improves detection accuracy and robustness. Experimental results on anti-UAV and Visdrone datasets demonstrate the effectiveness and practicality of our approach, achieving superior detection results for small and dim targets.

Title: DeepSelective: Feature Gating and Representation Matching for Interpretable Clinical Prediction

Authors: Ruochi Zhang, Qian Yang, Xiaoyang Wang, Haoran Wu, Qiong Zhou, Yu Wang, Kewei Li, Yueying Wang, Yusi Fan, Jiale Zhang, Lan Huang, Chang Liu, Fengfeng Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11264
Pdf URL: https://arxiv.org/pdf/2504.11264
Copy Paste: [[2504.11264]] DeepSelective: Feature Gating and Representation Matching for Interpretable Clinical Prediction(https://arxiv.org/abs/2504.11264)
Keywords: robust, interpretability
Abstract: The rapid accumulation of Electronic Health Records (EHRs) has transformed healthcare by providing valuable data that enhance clinical predictions and diagnoses. While conventional machine learning models have proven effective, they often lack robust representation learning and depend heavily on expert-crafted features. Although deep learning offers powerful solutions, it is often criticized for its lack of interpretability. To address these challenges, we propose DeepSelective, a novel end to end deep learning framework for predicting patient prognosis using EHR data, with a strong emphasis on enhancing model interpretability. DeepSelective combines data compression techniques with an innovative feature selection approach, integrating custom-designed modules that work together to improve both accuracy and interpretability. Our experiments demonstrate that DeepSelective not only enhances predictive accuracy but also significantly improves interpretability, making it a valuable tool for clinical decision-making. The source code is freely available at this http URL .

Title: Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution

Authors: Xinning Chai, Yao Zhang, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11271
Pdf URL: https://arxiv.org/pdf/2504.11271
Copy Paste: [[2504.11271]] Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution(https://arxiv.org/abs/2504.11271)
Keywords: large language model
Abstract: Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at this https URL.

Title: From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs

Authors: Guocong Li, Weize Liu, Yihang Wu, Ping Wang, Shuaihan Huang, Hongxia Xu, Jian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11277
Pdf URL: https://arxiv.org/pdf/2504.11277
Copy Paste: [[2504.11277]] From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs(https://arxiv.org/abs/2504.11277)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering (QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information. We will publicly release our code upon acceptance.

Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Authors: Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11289
Pdf URL: https://arxiv.org/pdf/2504.11289
Copy Paste: [[2504.11289]] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer(https://arxiv.org/abs/2504.11289)
Keywords: robust, diffusion, transformer, generative
Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.

Title: Automated Python Translation

Authors: Joshua Otten, Antonios Anastasopoulos, Kevin Moran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11290
Pdf URL: https://arxiv.org/pdf/2504.11290
Copy Paste: [[2504.11290]] Automated Python Translation(https://arxiv.org/abs/2504.11290)
Keywords: large language model
Abstract: Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However, those who do not speak English may not experience these advantages. In fact, they may even be hindered in their ability to understand Python code, as the English nature of its terms creates an additional layer of overhead. To that end, we introduce the task of automatically translating Python's natural modality (keywords, error types, identifiers, etc.) into other human languages. This presents a unique challenge, considering the abbreviated nature of these forms, as well as potential untranslatability of advanced mathematical/programming concepts across languages. We therefore create an automated pipeline to translate Python into other human languages, comparing strategies using machine translation and large language models. We then use this pipeline to acquire translations from five common Python libraries (pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a quality test on a subset of these terms in French, Greek, and Bengali. We hope this will provide a clearer path forward towards creating a universal Python, accessible to anyone regardless of nationality or language background.

Title: Autoregressive Distillation of Diffusion Transformers

Authors: Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11295
Pdf URL: https://arxiv.org/pdf/2504.11295
Copy Paste: [[2504.11295]] Autoregressive Distillation of Diffusion Transformers(https://arxiv.org/abs/2504.11295)
Keywords: diffusion, transformer
Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: this https URL.

Title: Context-Aware Palmprint Recognition via a Relative Similarity Metric

Authors: Trinnhallen Brisley, Aryan Gandhi, Joseph Magen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11306
Pdf URL: https://arxiv.org/pdf/2504.11306
Copy Paste: [[2504.11306]] Context-Aware Palmprint Recognition via a Relative Similarity Metric(https://arxiv.org/abs/2504.11306)
Keywords: robust
Abstract: We propose a new approach to matching mechanism for palmprint recognition by introducing a Relative Similarity Metric (RSM) that enhances the robustness and discriminability of existing matching frameworks. While conventional systems rely on direct pairwise similarity measures, such as cosine or Euclidean distances, these metrics fail to capture how a pairwise similarity compares within the context of the entire dataset. Our method addresses this by evaluating the relative consistency of similarity scores across up to all identities, allowing for better suppression of false positives and negatives. Applied atop the CCNet architecture, our method achieves a new state-of-the-art 0.000036% Equal Error Rate (EER) on the Tongji dataset, outperforming previous methods and demonstrating the efficacy of incorporating relational structure into the palmprint matching process.

Title: Big Brother is Watching: Proactive Deepfake Detection via Learnable Hidden Face

Authors: Hongbo Li, Shangchao Yang, Ruiyang Xia, Lin Yuan, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11309
Pdf URL: https://arxiv.org/pdf/2504.11309
Copy Paste: [[2504.11309]] Big Brother is Watching: Proactive Deepfake Detection via Learnable Hidden Face(https://arxiv.org/abs/2504.11309)
Keywords: protect, defense, robust, watermark
Abstract: As deepfake technologies continue to advance, passive detection methods struggle to generalize with various forgery manipulations and datasets. Proactive defense techniques have been actively studied with the primary aim of preventing deepfake operation effectively working. In this paper, we aim to bridge the gap between passive detection and proactive defense, and seek to solve the detection problem utilizing a proactive methodology. Inspired by several watermarking-based forensic methods, we explore a novel detection framework based on the concept of ``hiding a learnable face within a face''. Specifically, relying on a semi-fragile invertible steganography network, a secret template image is embedded into a host image imperceptibly, acting as an indicator monitoring for any malicious image forgery when being restored by the inverse steganography process. Instead of being manually specified, the secret template is optimized during training to resemble a neutral facial appearance, just like a ``big brother'' hidden in the image to be protected. By incorporating a self-blending mechanism and robustness learning strategy with a simulative transmission channel, a robust detector is built to accurately distinguish if the steganographic image is maliciously tampered or benignly processed. Finally, extensive experiments conducted on multiple datasets demonstrate the superiority of the proposed approach over competing passive and proactive detection methods.

Title: Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection

Authors: Dayong Liu, Qingrui Zhang, Zeyang Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11310
Pdf URL: https://arxiv.org/pdf/2504.11310
Copy Paste: [[2504.11310]] Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection(https://arxiv.org/abs/2504.11310)
Keywords: extraction
Abstract: In multi-target tracking and detection tasks, it is necessary to continuously track multiple targets, such as vehicles, pedestrians, etc. To achieve this goal, the system must be able to continuously acquire and process image frames containing these targets. These consecutive frame images enable the algorithm to update the position and state of the target in real-time in each frame of the image. How to accurately associate the detected target with the target in the previous or next frame to form a stable trajectory is a complex problem. Therefore, a multi object tracking and detection method for intelligent driving vehicles based on YOLOv5 and point cloud 3D projection is proposed. Using Retinex algorithm to enhance the image of the environment in front of the vehicle, remove light interference in the image, and build an intelligent detection model based on YOLOv5 network structure. The enhanced image is input into the model, and multiple targets in front of the vehicle are identified through feature extraction and target localization. By combining point cloud 3D projection technology, the correlation between the position changes of adjacent frame images in the projection coordinate system can be inferred. By sequentially projecting the multi-target recognition results of multiple consecutive frame images into the 3D laser point cloud environment, effective tracking of the motion trajectories of all targets in front of the vehicle can be achieved. The experimental results show that the application of this method for intelligent driving vehicle front multi-target tracking and detection yields a MOTA (Tracking Accuracy) value greater than 30, demonstrating its superior tracking and detection performance.

Title: Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Authors: Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang
Subjects: cs.LG, cs.AI, cs.DC, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2504.11320
Pdf URL: https://arxiv.org/pdf/2504.11320
Copy Paste: [[2504.11320]] Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints(https://arxiv.org/abs/2504.11320)
Keywords: large language model
Abstract: Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure -- generating responses by processing text in segments and using a memory-heavy Key-Value (KV) cache -- demands significant computational resources, particularly under memory constraints. This paper formulates LLM inference optimization as a multi-stage online scheduling problem where sequential prompt arrivals and KV cache growth render conventional scheduling ineffective. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design. Building on this, we propose the Waiting for Accumulated Inference Threshold (WAIT) algorithm, which uses multiple thresholds to schedule incoming prompts optimally when output lengths are known, and extend it to Nested WAIT for cases with unknown output lengths. Theoretical analysis shows that both algorithms achieve near-optimal performance against the fluid benchmark in heavy traffic conditions, balancing throughput, latency, and Time to First Token (TTFT). Experiments with the Llama-7B model on an A100 GPU using both synthetic and real-world datasets demonstrate improved throughput and reduced latency relative to established baselines like vLLM and Sarathi. This work bridges operations research and machine learning, offering a rigorous framework for the efficient deployment of LLMs under memory constraints.

Title: PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

Authors: Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Che, Wei Zhan, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11326
Pdf URL: https://arxiv.org/pdf/2504.11326
Copy Paste: [[2504.11326]] PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild(https://arxiv.org/abs/2504.11326)
Keywords: segmentation
Abstract: This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: this https URL.

Title: A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Authors: Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2504.11343
Pdf URL: https://arxiv.org/pdf/2504.11343
Copy Paste: [[2504.11343]] A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce(https://arxiv.org/abs/2504.11343)
Keywords: robust, large language model
Abstract: Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

Title: Interpretable Hybrid-Rule Temporal Point Processes

Authors: Yunyang Cao, Juekai Lin, Hongye Wang, Wenhao Li, Bo Jin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.11344
Pdf URL: https://arxiv.org/pdf/2504.11344
Copy Paste: [[2504.11344]] Interpretable Hybrid-Rule Temporal Point Processes(https://arxiv.org/abs/2504.11344)
Keywords: interpretability
Abstract: Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.

Title: DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

Authors: Soyoung Yoo, Namwoo Kang
Subjects: cs.CV, physics.app-ph
Abstract URL: https://arxiv.org/abs/2504.11347
Pdf URL: https://arxiv.org/pdf/2504.11347
Copy Paste: [[2504.11347]] DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation(https://arxiv.org/abs/2504.11347)
Keywords: diffusion, generative
Abstract: Data-driven design is emerging as a powerful strategy to accelerate engineering innovation. However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics. To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI. The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation. Structural simulations are subsequently performed to extract engineering performance data. To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs. The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration. The proposed methodology is also applicable to other complex design domains. The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the this https URL

Title: DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

Authors: Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11358
Pdf URL: https://arxiv.org/pdf/2504.11358
Copy Paste: [[2504.11358]] DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks(https://arxiv.org/abs/2504.11358)
Keywords: attack
Abstract: LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

Title: Teaching Large Language Models to Reason through Learning and Forgetting

Authors: Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.11364
Pdf URL: https://arxiv.org/pdf/2504.11364
Copy Paste: [[2504.11364]] Teaching Large Language Models to Reason through Learning and Forgetting(https://arxiv.org/abs/2504.11364)
Keywords: large language model
Abstract: Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it using both successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. While fine-tuning the model with these data might seem straightforward, we identify a critical issue: the model's search capability tends to degrade rapidly if fine-tuning is performed naively. We show that this degradation can be substantially mitigated by employing a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown mathematical reasoning benchmarks show that our approach not only outperforms both standard fine-tuning and inference-time search baselines but also significantly reduces inference time by 180$\times$.

Title: A Decade of Wheat Mapping for Lebanon

Authors: Hasan Wehbi, Hasan Nasrallah, Mohamad Hasan Zahweh, Zeinab Takach, Veera Ganesh Yalla, Ali J. Ghandour
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11366
Pdf URL: https://arxiv.org/pdf/2504.11366
Copy Paste: [[2504.11366]] A Decade of Wheat Mapping for Lebanon(https://arxiv.org/abs/2504.11366)
Keywords: security, extraction, transformer, segmentation
Abstract: Wheat accounts for approximately 20% of the world's caloric intake, making it a vital component of global food security. Given this importance, mapping wheat fields plays a crucial role in enabling various stakeholders, including policy makers, researchers, and agricultural organizations, to make informed decisions regarding food security, supply chain management, and resource allocation. In this paper, we tackle the problem of accurately mapping wheat fields out of satellite images by introducing an improved pipeline for winter wheat segmentation, as well as presenting a case study on a decade-long analysis of wheat mapping in Lebanon. We integrate a Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and a novel post-processing pipeline based on the Fields of The World (FTW) framework. Our proposed pipeline addresses key challenges encountered in existing approaches, such as the clustering of small agricultural parcels in a single large field. By merging wheat segmentation with precise field boundary extraction, our method produces geometrically coherent and semantically rich maps that enable us to perform in-depth analysis such as tracking crop rotation pattern over years. Extensive evaluations demonstrate improved boundary delineation and field-level precision, establishing the potential of the proposed framework in operational agricultural monitoring and historical trend analysis. By allowing for accurate mapping of wheat fields, this work lays the foundation for a range of critical studies and future advances, including crop monitoring and yield estimation.

Title: From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

Authors: Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Tao Tan, Vicente Grau, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11368
Pdf URL: https://arxiv.org/pdf/2504.11368
Copy Paste: [[2504.11368]] From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation(https://arxiv.org/abs/2504.11368)
Keywords: interpretability, segmentation
Abstract: Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: this https URL.

Title: OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution

Authors: Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2504.11369
Pdf URL: https://arxiv.org/pdf/2504.11369
Copy Paste: [[2504.11369]] OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution(https://arxiv.org/abs/2504.11369)
Keywords: generative, large language model
Abstract: Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at this https URL

Title: Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

Authors: Wang Bill Zhu, Tianqi Chen, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.11373
Pdf URL: https://arxiv.org/pdf/2504.11373
Copy Paste: [[2504.11373]] Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions(https://arxiv.org/abs/2504.11373)
Keywords: robust, large language model
Abstract: Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, this http URL, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.

Title: RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models

Authors: Juan Diego Rodriguez, Wenxuan Ding, Katrin Erk, Greg Durrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11381
Pdf URL: https://arxiv.org/pdf/2504.11381
Copy Paste: [[2504.11381]] RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models(https://arxiv.org/abs/2504.11381)
Keywords: large language model
Abstract: Although large language models (LLMs) have become generally more capable and accurate across many tasks, some fundamental sources of unreliability remain in their behavior. One key limitation is their inconsistency at reporting the the same information when prompts are changed. In this paper, we consider the discrepancy between a model's generated answer and their own verification of that answer, the generator-validator gap. We define this gap in a more stringent way than prior work: we expect correlation of scores from a generator and a validator over the entire set of candidate answers. We show that according to this measure, a large gap exists in various settings, including question answering, lexical semantics tasks, and next-word prediction. We then propose RankAlign, a ranking-based training method, and show that it significantly closes the gap by 31.8% on average, surpassing all baseline methods. Moreover, this approach generalizes well to out-of-domain tasks and lexical items.

Title: DataDecide: How to Predict Best Pretraining Data with Small Experiments

Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.11393
Pdf URL: https://arxiv.org/pdf/2504.11393
Copy Paste: [[2504.11393]] DataDecide: How to Predict Best Pretraining Data with Small Experiments(https://arxiv.org/abs/2504.11393)
Keywords: large language model
Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

Title: Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Authors: Nikolette Pedersen, Regitze Sydendal, Andreas Wulff, Ralf Raumanns, Eike Petersen, Veronika Cheplygina
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11415
Pdf URL: https://arxiv.org/pdf/2504.11415
Copy Paste: [[2504.11415]] Robustness and sex differences in skin cancer detection: logistic regression vs CNNs(https://arxiv.org/abs/2504.11415)
Keywords: robust
Abstract: Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer's disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.

Title: Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps

Authors: Panagiotis Agrafiotis, Begüm Demir
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.11416
Pdf URL: https://arxiv.org/pdf/2504.11416
Copy Paste: [[2504.11416]] Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps(https://arxiv.org/abs/2504.11416)
Keywords: transformer
Abstract: Accurate, detailed, and high-frequent bathymetry is crucial for shallow seabed areas facing intense climatological and anthropogenic pressures. Current methods utilizing airborne or satellite optical imagery to derive bathymetry primarily rely on either SfM-MVS with refraction correction or Spectrally Derived Bathymetry (SDB). However, SDB methods often require extensive manual fieldwork or costly reference data, while SfM-MVS approaches face challenges even after refraction correction. These include depth data gaps and noise in environments with homogeneous visual textures, which hinder the creation of accurate and complete Digital Surface Models (DSMs) of the seabed. To address these challenges, this work introduces a methodology that combines the high-fidelity 3D reconstruction capabilities of the SfM-MVS methods with state-of-the-art refraction correction techniques, along with the spectral analysis capabilities of a new deep learning-based method for bathymetry prediction. This integration enables a synergistic approach where SfM-MVS derived DSMs with data gaps are used as training data to generate complete bathymetric maps. In this context, we propose Swin-BathyUNet that combines U-Net with Swin Transformer self-attention layers and a cross-attention mechanism, specifically tailored for SDB. Swin-BathyUNet is designed to improve bathymetric accuracy by capturing long-range spatial relationships and can also function as a standalone solution for standard SDB with various training depth data, independent of the SfM-MVS output. Experimental results in two completely different test sites in the Mediterranean and Baltic Seas demonstrate the effectiveness of the proposed approach through extensive experiments that demonstrate improvements in bathymetric accuracy, detail, coverage, and noise reduction in the predicted DSM. The code is available at this https URL.

Title: Leveraging Point Transformers for Detecting Anatomical Landmarks in Digital Dentistry

Authors: Tibor Kubík, Oldřich Kodym, Petr Šilling, Kateřina Trávníčková, Tomáš Mojžiš, Jan Matula
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11418
Pdf URL: https://arxiv.org/pdf/2504.11418
Copy Paste: [[2504.11418]] Leveraging Point Transformers for Detecting Anatomical Landmarks in Digital Dentistry(https://arxiv.org/abs/2504.11418)
Keywords: interpretability, transformer
Abstract: The increasing availability of intraoral scanning devices has heightened their importance in modern clinical orthodontics. Clinicians utilize advanced Computer-Aided Design techniques to create patient-specific treatment plans that include laboriously identifying crucial landmarks such as cusps, mesial-distal locations, facial axis points, and tooth-gingiva boundaries. Detecting such landmarks automatically presents challenges, including limited dataset sizes, significant anatomical variability among subjects, and the geometric nature of the data. We present our experiments from the 3DTeethLand Grand Challenge at MICCAI 2024. Our method leverages recent advancements in point cloud learning through transformer architectures. We designed a Point Transformer v3 inspired module to capture meaningful geometric and anatomical features, which are processed by a lightweight decoder to predict per-point distances, further processed by graph-based non-minima suppression. We report promising results and discuss insights on learned feature interpretability.

Title: Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts

Authors: Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, Sinno Jialin Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11420
Pdf URL: https://arxiv.org/pdf/2504.11420
Copy Paste: [[2504.11420]] Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts(https://arxiv.org/abs/2504.11420)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM's preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.

Title: ADT: Tuning Diffusion Models with Adversarial Supervision

Authors: Dazhong Shen, Guanglu Song, Yi Zhang, Bingqi Ma, Lujundong Li, Dongzhi Jiang, Zhuofan Zong, Yu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11423
Pdf URL: https://arxiv.org/pdf/2504.11423
Copy Paste: [[2504.11423]] ADT: Tuning Diffusion Models with Adversarial Supervision(https://arxiv.org/abs/2504.11423)
Keywords: robust, diffusion
Abstract: Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.

Title: A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Authors: Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11426
Pdf URL: https://arxiv.org/pdf/2504.11426
Copy Paste: [[2504.11426]] A Dual-Space Framework for General Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2504.11426)
Keywords: large language model
Abstract: Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

Title: NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Authors: Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11427
Pdf URL: https://arxiv.org/pdf/2504.11427
Copy Paste: [[2504.11427]] NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors(https://arxiv.org/abs/2504.11427)
Keywords: secure, diffusion
Abstract: Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

Title: Improving Statistical Privacy by Subsampling

Authors: Dennis Breutigam, Rüdiger Reischuk
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.11429
Pdf URL: https://arxiv.org/pdf/2504.11429
Copy Paste: [[2504.11429]] Improving Statistical Privacy by Subsampling(https://arxiv.org/abs/2504.11429)
Keywords: privacy
Abstract: Differential privacy (DP) considers a scenario, where an adversary has almost complete information about the entries of a database This worst-case assumption is likely to overestimate the privacy thread for an individual in real life. Statistical privacy (SP) denotes a setting where only the distribution of the database entries is known to an adversary, but not their exact values. In this case one has to analyze the interaction between noiseless privacy based on the entropy of distributions and privacy mechanisms that distort the answers of queries, which can be quite complex. A privacy mechanism often used is to take samples of the data for answering a query. This paper proves precise bounds how much different methods of sampling increase privacy in the statistical setting with respect to database size and sampling rate. They allow us to deduce when and how much sampling provides an improvement and how far this depends on the privacy parameter {\epsilon}. To perform these investigations we develop a framework to model sampling techniques. For the DP setting tradeoff functions have been proposed as a finer measure for privacy compared to ({\epsilon},{\delta})-pairs. We apply these tools to statistical privacy with subsampling to get a comparable characterization

Title: Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

Authors: Maria Teleki, Xiangjue Dong, Haoran Liu, James Caverlee
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2504.11431
Pdf URL: https://arxiv.org/pdf/2504.11431
Copy Paste: [[2504.11431]] Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models(https://arxiv.org/abs/2504.11431)
Keywords: robust, large language model
Abstract: Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.

Title: Enhancing Out-of-Distribution Detection with Extended Logit Normalization

Authors: Yifan Ding, Xixi Liu, Jonas Unger, Gabriel Eilertsen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11434
Pdf URL: https://arxiv.org/pdf/2504.11434
Copy Paste: [[2504.11434]] Enhancing Out-of-Distribution Detection with Extended Logit Normalization(https://arxiv.org/abs/2504.11434)
Keywords: robust
Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Recent advances have explored improved classification losses and representation learning strategies to enhance OOD detection. However, these methods are often tailored to specific post-hoc detection techniques, limiting their generalizability. In this work, we identify a critical issue in Logit Normalization (LogitNorm), which inhibits its effectiveness in improving certain post-hoc OOD detection methods. To address this, we propose Extended Logit Normalization ($\textbf{ELogitNorm}$), a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. By incorporating feature distance-awareness to LogitNorm, $\textbf{ELogitNorm}$ shows more robust OOD separability and in-distribution (ID) confidence calibration than its predecessor. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.

Title: Mamba-Based Ensemble learning for White Blood Cell Classification

Authors: Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, Nantheera Anantrasirichai
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.11438
Pdf URL: https://arxiv.org/pdf/2504.11438
Copy Paste: [[2504.11438]] Mamba-Based Ensemble learning for White Blood Cell Classification(https://arxiv.org/abs/2504.11438)
Keywords: transformer
Abstract: White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at this https URL.

Title: TextArena

Authors: Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2504.11442
Pdf URL: https://arxiv.org/pdf/2504.11442
Copy Paste: [[2504.11442]] TextArena(https://arxiv.org/abs/2504.11442)
Keywords: large language model
Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on this https URL and this https URL.

Title: Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion

Authors: An Zhaol, Shengyuan Zhang, Ling Yang, Zejian Li, Jiale Wu, Haoran Xu, AnYang Wei, Perry Pengyun GU Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11447
Pdf URL: https://arxiv.org/pdf/2504.11447
Copy Paste: [[2504.11447]] Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion(https://arxiv.org/abs/2504.11447)
Keywords: diffusion
Abstract: The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion's slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on this https URL.

Title: PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

Authors: Minghua Liu, Mikaela Angelina Uy, Donglai Xiang, Hao Su, Sanja Fidler, Nicholas Sharp, Jun Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11451
Pdf URL: https://arxiv.org/pdf/2504.11451
Copy Paste: [[2504.11451]] PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond(https://arxiv.org/abs/2504.11451)
Keywords: robust, segmentation
Abstract: We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! this https URL

Title: A Clean Slate for Offline Reinforcement Learning

Authors: Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, Jakob Nicolaus Foerster
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11453
Pdf URL: https://arxiv.org/pdf/2504.11453
Copy Paste: [[2504.11453]] A Clean Slate for Offline Reinforcement Learning(https://arxiv.org/abs/2504.11453)
Keywords: fair
Abstract: Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at this https URL.

Title: Elucidating the Design Space of Multimodal Protein Language Models

Authors: Cheng-Yen (Wesley)Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.11454
Pdf URL: https://arxiv.org/pdf/2504.11454
Copy Paste: [[2504.11454]] Elucidating the Design Space of Multimodal Protein Language Models(https://arxiv.org/abs/2504.11454)
Keywords: robust, generative
Abstract: Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.

Title: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11457
Pdf URL: https://arxiv.org/pdf/2504.11457
Copy Paste: [[2504.11457]] Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception(https://arxiv.org/abs/2504.11457)
Keywords: diffusion, generative, segmentation
Abstract: With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.