2025-04-22

Title: Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Authors: Deyu Cao, Samin Aref
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.13932
Pdf URL: https://arxiv.org/pdf/2504.13932
Copy Paste: [[2504.13932]] Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining(https://arxiv.org/abs/2504.13932)
Keywords: large language model
Abstract: Large language models offer remarkable capabilities, but their size and computational demands pose practical challenges. Quantization methods compress their size through replacing their high-precision parameters by quantized values of lower precision. Post-training quantization reduces model size efficiently at the cost of decreased accuracy, while quantization-aware training better preserves accuracy but is resource-intensive. Among existing post-training quantization algorithms, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ's level. First, we look into combining existing quantization-aware training techniques with ApiQ's partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining may not be feasible through partial training. (2) This gain seems to depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. It relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on benchmark language models from the LLaMA family show that our proposed approach boosts accuracy and tightens the gap between the quantized model and the full-precision model, with minimal overhead. Our method will be made publicly available to facilitate future developments in ultra-low-bit quantization of large language models.

Title: NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning

Authors: Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakhturi, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13941
Pdf URL: https://arxiv.org/pdf/2504.13941
Copy Paste: [[2504.13941]] NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning(https://arxiv.org/abs/2504.13941)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.

Title: Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

Authors: Zhongxi Qiu, Zhang Zhang, Yan Hu, Heng Li, Jiang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13950
Pdf URL: https://arxiv.org/pdf/2504.13950
Copy Paste: [[2504.13950]] Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain(https://arxiv.org/abs/2504.13950)
Keywords: robust, large language model
Abstract: This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (this https URL) to get the codes.

Title: Generative System Dynamics in Recurrent Neural Networks

Authors: Michele Casoni, Tommaso Guidi, Alessandro Betti, Stefano Melacci, Marco Gori
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13951
Pdf URL: https://arxiv.org/pdf/2504.13951
Copy Paste: [[2504.13951]] Generative System Dynamics in Recurrent Neural Networks(https://arxiv.org/abs/2504.13951)
Keywords: generative
Abstract: In this study, we investigate the continuous time dynamics of Recurrent Neural Networks (RNNs), focusing on systems with nonlinear activation functions. The objective of this work is to identify conditions under which RNNs exhibit perpetual oscillatory behavior, without converging to static fixed points. We establish that skew-symmetric weight matrices are fundamental to enable stable limit cycles in both linear and nonlinear configurations. We further demonstrate that hyperbolic tangent-like activation functions (odd, bounded, and continuous) preserve these oscillatory dynamics by ensuring motion invariants in state space. Numerical simulations showcase how nonlinear activation functions not only maintain limit cycles, but also enhance the numerical stability of the system integration process, mitigating those instabilities that are commonly associated with the forward Euler method. The experimental results of this analysis highlight practical considerations for designing neural architectures capable of capturing complex temporal dependencies, i.e., strategies for enhancing memorization skills in recurrent models.

Title: Prognosis Of Lithium-Ion Battery Health with Hybrid EKF-CNN+LSTM Model Using Differential Capacity

Authors: Md Azizul Hoque, Babul Salam, Mohd Khair Hassan, Abdulkabir Aliyu, Abedalmuhdi Almomany, Muhammed Sutcu
Subjects: cs.LG, eess.SY, stat.AP
Abstract URL: https://arxiv.org/abs/2504.13956
Pdf URL: https://arxiv.org/pdf/2504.13956
Copy Paste: [[2504.13956]] Prognosis Of Lithium-Ion Battery Health with Hybrid EKF-CNN+LSTM Model Using Differential Capacity(https://arxiv.org/abs/2504.13956)
Keywords: robust
Abstract: Battery degradation is a major challenge in electric vehicles (EV) and energy storage systems (ESS). However, most degradation investigations focus mainly on estimating the state of charge (SOC), which fails to accurately interpret the cells' internal degradation mechanisms. Differential capacity analysis (DCA) focuses on the rate of change of cell voltage about the change in cell capacity, under various charge/discharge rates. This paper developed a battery cell degradation testing model that used two types of lithium-ions (Li-ion) battery cells, namely lithium nickel cobalt aluminium oxides (LiNiCoAlO2) and lithium iron phosphate (LiFePO4), to evaluate internal degradation during loading conditions. The proposed battery degradation model contains distinct charge rates (DCR) of 0.2C, 0.5C, 1C, and 1.5C, as well as discharge rates (DDR) of 0.5C, 0.9C, 1.3C, and 1.6C to analyze the internal health and performance of battery cells during slow, moderate, and fast loading conditions. Besides, this research proposed a model that incorporates the Extended Kalman Filter (EKF), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM) networks to validate experimental data. The proposed model yields excellent modelling results based on mean squared error (MSE), and root mean squared error (RMSE), with errors of less than 0.001% at DCR and DDR. The peak identification technique (PIM) has been utilized to investigate battery health based on the number of peaks, peak position, peak height, peak area, and peak width. At last, the PIM method has discovered that the cell aged gradually under normal loading rates but deteriorated rapidly under fast loading conditions. Overall, LiFePO4 batteries perform more robustly and consistently than (LiNiCoAlO2) cells under varying loading conditions.

Title: ToolRL: Reward is All Tool Learning Needs

Authors: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.13958
Pdf URL: https://arxiv.org/pdf/2504.13958
Copy Paste: [[2504.13958]] ToolRL: Reward is All Tool Learning Needs(https://arxiv.org/abs/2504.13958)
Keywords: robust, large language model
Abstract: Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

Title: CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee

Authors: Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.13961
Pdf URL: https://arxiv.org/pdf/2504.13961
Copy Paste: [[2504.13961]] CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee(https://arxiv.org/abs/2504.13961)
Keywords: robust
Abstract: Accurate short-term traffic demand prediction is critical for the operation of traffic systems. Besides point estimation, the confidence interval of the prediction is also of great importance. Many models for traffic operations, such as shared bike rebalancing and taxi dispatching, take into account the uncertainty of future demand and require confidence intervals as the input. However, existing methods for confidence interval modeling rely on strict assumptions, such as unchanging traffic patterns and correct model specifications, to guarantee enough coverage. Therefore, the confidence intervals provided could be invalid, especially in a changing traffic environment. To fill this gap, we propose an efficient method, CONTINA (Conformal Traffic Intervals with Adaptation) to provide interval predictions that can adapt to external changes. By collecting the errors of interval during deployment, the method can adjust the interval in the next step by widening it if the errors are too large or shortening it otherwise. Furthermore, we theoretically prove that the coverage of the confidence intervals provided by our method converges to the target coverage level. Experiments across four real-world datasets and prediction models demonstrate that the proposed method can provide valid confidence intervals with shorter lengths. Our method can help traffic management personnel develop a more reasonable and robust operation plan in practice. And we release the code, model and dataset in \href{ this https URL}{ Github}.

Title: Adversarial Resilience against Clean-Label Attacks in Realizable and Noisy Settings

Authors: Carolin Heinzler
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.13966
Pdf URL: https://arxiv.org/pdf/2504.13966
Copy Paste: [[2504.13966]] Adversarial Resilience against Clean-Label Attacks in Realizable and Noisy Settings(https://arxiv.org/abs/2504.13966)
Keywords: attack
Abstract: We investigate the challenge of establishing stochastic-like guarantees when sequentially learning from a stream of i.i.d. data that includes an unknown quantity of clean-label adversarial samples. We permit the learner to abstain from making predictions when uncertain. The regret of the learner is measured in terms of misclassification and abstention error, where we allow the learner to abstain for free on adversarial injected samples. This approach is based on the work of Goel, Hanneke, Moran, and Shetty from arXiv:2306.13119. We explore the methods they present and manage to correct inaccuracies in their argumentation. However, this approach is limited to the realizable setting, where labels are assigned according to some function $f^*$ from the hypothesis space $\mathcal{F}$. Based on similar arguments, we explore methods to make adaptations for the agnostic setting where labels are random. Introducing the notion of a clean-label adversary in the agnostic context, we are the first to give a theoretical analysis of a disagreement-based learner for thresholds, subject to a clean-label adversary with noise.

Title: Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing

Authors: Mehmet Yamaç, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.13975
Pdf URL: https://arxiv.org/pdf/2504.13975
Copy Paste: [[2504.13975]] Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing(https://arxiv.org/abs/2504.13975)
Keywords: transformer
Abstract: Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at this https URL.

Title: CacheFormer: High Attention-Based Segment Caching

Authors: Sushant Singh, Ausif Mahmood
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13981
Pdf URL: https://arxiv.org/pdf/2504.13981
Copy Paste: [[2504.13981]] CacheFormer: High Attention-Based Segment Caching(https://arxiv.org/abs/2504.13981)
Keywords: transformer
Abstract: Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs)., have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data is retrieved from the memory, but the adjacent data is also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our en-hancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top k high attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms ex-isting SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.

Title: QuatE-D: A Distance-Based Quaternion Model for Knowledge Graph Embedding

Authors: Hamideh-Sadat Fazael-Ardakani, Hamid Soltanian-Zadeh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.13983
Pdf URL: https://arxiv.org/pdf/2504.13983
Copy Paste: [[2504.13983]] QuatE-D: A Distance-Based Quaternion Model for Knowledge Graph Embedding(https://arxiv.org/abs/2504.13983)
Keywords: interpretability
Abstract: Knowledge graph embedding (KGE) methods aim to represent entities and relations in a continuous space while preserving their structural and semantic properties. Quaternion-based KGEs have demonstrated strong potential in capturing complex relational patterns. In this work, we propose QuatE-D, a novel quaternion-based model that employs a distance-based scoring function instead of traditional inner-product approaches. By leveraging Euclidean distance, QuatE-D enhances interpretability and provides a more flexible representation of relational structures. Experimental results demonstrate that QuatE-D achieves competitive performance while maintaining an efficient parameterization, particularly excelling in Mean Rank reduction. These findings highlight the effectiveness of distance-based scoring in quaternion embeddings, offering a promising direction for knowledge graph completion.

Title: One Jump Is All You Need: Short-Cutting Transformers for Early Exit Prediction with One Jump to Fit All Exit Levels

Authors: Amrit Diggavi Seshadri
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.13984
Pdf URL: https://arxiv.org/pdf/2504.13984
Copy Paste: [[2504.13984]] One Jump Is All You Need: Short-Cutting Transformers for Early Exit Prediction with One Jump to Fit All Exit Levels(https://arxiv.org/abs/2504.13984)
Keywords: transformer, large language model
Abstract: To reduce the time and computational costs of inference of large language models, there has been interest in parameter-efficient low-rank early-exit casting of transformer hidden-representations to final-representations. Such low-rank short-cutting has been shown to outperform identity shortcuts at early model stages while offering parameter-efficiency in shortcut jumps. However, current low-rank methods maintain a separate early-exit shortcut jump to final-representations for each transformer intermediate block-level during inference. In this work, we propose selection of a single One-Jump-Fits-All (OJFA) low-rank shortcut that offers over a 30x reduction in shortcut parameter costs during inference. We show that despite this extreme reduction, our OJFA choice largely matches the performance of maintaining multiple shortcut jumps during inference and offers stable precision from all transformer block-levels for GPT2-XL, Phi3-Mini and Llama2-7B transformer models.

Title: Entropy Rectifying Guidance for Diffusion and Flow Models

Authors: Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, Karteek Alahari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13987
Pdf URL: https://arxiv.org/pdf/2504.13987
Copy Paste: [[2504.13987]] Entropy Rectifying Guidance for Diffusion and Flow Models(https://arxiv.org/abs/2504.13987)
Keywords: diffusion, transformer, generative
Abstract: Guidance techniques are commonly used in diffusion and flow models to improve image quality and consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) -- the most widely adopted guidance technique -- contrasts conditional and unconditional predictions to improve the generated images. This results, however, in trade-offs across quality, diversity and consistency, improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this paper, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance mechanism based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. ERG results in significant improvements in various generation tasks such as text-to-image, class-conditional and unconditional image generation. We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further boosting generation performance.

Title: Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Authors: Lucas Maisonnave, Cyril Moineau, Olivier Bichler, Fabrice Rastello
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.13989
Pdf URL: https://arxiv.org/pdf/2504.13989
Copy Paste: [[2504.13989]] Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs(https://arxiv.org/abs/2504.13989)
Keywords: large language model
Abstract: Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40\% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing this http URL achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

Title: PC-DeepNet: A GNSS Positioning Error Minimization Framework Using Permutation-Invariant Deep Neural Network

Authors: M. Humayun Kabir, Md. Ali Hasan, Md. Shafiqul Islam, Kyeongjun Ko, Wonjae Shin
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2504.13990
Pdf URL: https://arxiv.org/pdf/2504.13990
Copy Paste: [[2504.13990]] PC-DeepNet: A GNSS Positioning Error Minimization Framework Using Permutation-Invariant Deep Neural Network(https://arxiv.org/abs/2504.13990)
Keywords: robust
Abstract: Global navigation satellite systems (GNSS) face significant challenges in urban and sub-urban areas due to non-line-of-sight (NLOS) propagation, multipath effects, and low received power levels, resulting in highly non-linear and non-Gaussian measurement error distributions. In light of this, conventional model-based positioning approaches, which rely on Gaussian error approximations, struggle to achieve precise localization under these conditions. To overcome these challenges, we put forth a novel learning-based framework, PC-DeepNet, that employs a permutation-invariant (PI) deep neural network (DNN) to estimate position corrections (PC). This approach is designed to ensure robustness against changes in the number and/or order of visible satellite measurements, a common issue in GNSS systems, while leveraging NLOS and multipath indicators as features to enhance positioning accuracy in challenging urban and sub-urban environments. To validate the performance of the proposed framework, we compare the positioning error with state-of-the-art model-based and learning-based positioning methods using two publicly available datasets. The results confirm that proposed PC-DeepNet achieves superior accuracy than existing model-based and learning-based methods while exhibiting lower computational complexity compared to previous learning-based approaches.

Title: Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

Authors: Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.13995
Pdf URL: https://arxiv.org/pdf/2504.13995
Copy Paste: [[2504.13995]] Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training(https://arxiv.org/abs/2504.13995)
Keywords: large language model
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.

Title: Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Authors: Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.14011
Pdf URL: https://arxiv.org/pdf/2504.14011
Copy Paste: [[2504.14011]] Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation(https://arxiv.org/abs/2504.14011)
Keywords: diffusion, generative
Abstract: In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

Title: Post Quantum Cryptography (PQC) Signatures Without Trapdoors

Authors: William J Buchanan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14016
Pdf URL: https://arxiv.org/pdf/2504.14016
Copy Paste: [[2504.14016]] Post Quantum Cryptography (PQC) Signatures Without Trapdoors(https://arxiv.org/abs/2504.14016)
Keywords: security
Abstract: Some of our current public key methods use a trap door to implement digital signature methods. This includes the RSA method, which uses Fermat's little theorem to support the creation and verification of a digital signature. The problem with a back-door is that the actual trap-door method could, in the end, be discovered. With the rise of PQC (Post Quantum Cryptography), we will see a range of methods that will not use trap doors and provide stronger proof of security. In this case, we use hash-based signatures (as used with SPHINCS+) and Fiat Shamir signatures using Zero Knowledge Proofs (as used with Dilithium).

Title: Large Language Bayes

Authors: Justin Domke
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14025
Pdf URL: https://arxiv.org/pdf/2504.14025
Copy Paste: [[2504.14025]] Large Language Bayes(https://arxiv.org/abs/2504.14025)
Keywords: large language model
Abstract: Many domain experts do not have the time or training to write formal Bayesian models. This paper takes an informal problem description as input, and combines a large language model and a probabilistic programming language to create a joint distribution over formal models, latent variables, and data. A posterior over latent variables follows by conditioning on observed data and integrating over formal models. This presents a challenging inference problem. We suggest an inference recipe that amounts to generating many formal models from the large language model, performing approximate inference on each, and then doing a weighted average. This is justified an analyzed as a combination of self-normalized importance sampling, MCMC, and variational inference. We show that this produces sensible predictions without the need to specify a formal model.

Title: LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Authors: Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2504.14032
Pdf URL: https://arxiv.org/pdf/2504.14032
Copy Paste: [[2504.14032]] LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models(https://arxiv.org/abs/2504.14032)
Keywords: transformer
Abstract: Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at this https URL.

Title: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Authors: Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14039
Pdf URL: https://arxiv.org/pdf/2504.14039
Copy Paste: [[2504.14039]] MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks(https://arxiv.org/abs/2504.14039)
Keywords: security, large language model
Abstract: As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

Title: A synthetic dataset of French electric load curves with temperature conditioning

Authors: Tahar Nabil, Ghislain Agoua, Pierre Cauchois, Anne De Moliner, Benoît Grossin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14046
Pdf URL: https://arxiv.org/pdf/2504.14046
Copy Paste: [[2504.14046]] A synthetic dataset of French electric load curves with temperature conditioning(https://arxiv.org/abs/2504.14046)
Keywords: privacy, diffusion
Abstract: The undergoing energy transition is causing behavioral changes in electricity use, e.g. with self-consumption of local generation, or flexibility services for demand control. To better understand these changes and the challenges they induce, accessing individual smart meter data is crucial. Yet this is personal data under the European GDPR. A widespread use of such data requires thus to create synthetic realistic and privacy-preserving samples. This paper introduces a new synthetic load curve dataset generated by conditional latent diffusion. We also provide the contracted power, time-of-use plan and local temperature used for generation. Fidelity, utility and privacy of the dataset are thoroughly evaluated, demonstrating its good quality and thereby supporting its interest for energy modeling applications.

Title: CAOTE: KV Caching through Attention Output Error based Token Eviction

Authors: Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14051
Pdf URL: https://arxiv.org/pdf/2504.14051
Copy Paste: [[2504.14051]] CAOTE: KV Caching through Attention Output Error based Token Eviction(https://arxiv.org/abs/2504.14051)
Keywords: large language model
Abstract: While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value vector information on top of attention-based eviction scores. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

Title: Occlusion-Ordered Semantic Instance Segmentation

Authors: Soroosh Baselizadeh, Cheuk-To Yu, Olga Veksler, Yuri Boykov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14054
Pdf URL: https://arxiv.org/pdf/2504.14054
Copy Paste: [[2504.14054]] Occlusion-Ordered Semantic Instance Segmentation(https://arxiv.org/abs/2504.14054)
Keywords: segmentation
Abstract: Standard semantic instance segmentation provides useful, but inherently 2D information from a single image. To enable 3D analysis, one usually integrates absolute monocular depth estimation with instance segmentation. However, monocular depth is a difficult task. Instead, we leverage a simpler single-image task, occlusion-based relative depth ordering, providing coarser but useful 3D information. We show that relative depth ordering works more reliably from occlusions than from absolute depth. We propose to solve the joint task of relative depth ordering and segmentation of instances based on occlusions. We call this task Occlusion-Ordered Semantic Instance Segmentation (OOSIS). We develop an approach to OOSIS that extracts instances and their occlusion order simultaneously from oriented occlusion boundaries and semantic segmentation. Unlike popular detect-and-segment framework for instance segmentation, combining occlusion ordering with instance segmentation allows a simple and clean formulation of OOSIS as a labeling problem. As a part of our solution for OOSIS, we develop a novel oriented occlusion boundaries approach that significantly outperforms prior work. We also develop a new joint OOSIS metric based both on instance mask accuracy and correctness of their occlusion order. We achieve better performance than strong baselines on KINS and COCOA datasets.

Title: Benchmarking Differentially Private Tabular Data Synthesis

Authors: Kai Chen, Xiaochen Li, Chen Gong, Ryan McKenna, Tianhao Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14061
Pdf URL: https://arxiv.org/pdf/2504.14061
Copy Paste: [[2504.14061]] Benchmarking Differentially Private Tabular Data Synthesis(https://arxiv.org/abs/2504.14061)
Keywords: privacy, fair
Abstract: Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, lack of in-depth algorithm analysis, and incomplete comparisons due to overlapping development timelines. These factors create significant obstacles to selecting appropriate algorithms. In this paper, we address these challenges by proposing a benchmark for evaluating tabular data synthesis methods. We present a unified evaluation framework that integrates data preprocessing, feature selection, and synthesis modules, facilitating fair and comprehensive comparisons. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. Some statistical methods are superior in synthesis utility, but their efficiency is not as good as most machine learning-based methods. Furthermore, we conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies.

Title: DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Authors: Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy Dvijotham
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14064
Pdf URL: https://arxiv.org/pdf/2504.14064
Copy Paste: [[2504.14064]] DoomArena: A framework for Testing AI Agents Against Evolving Security Threats(https://arxiv.org/abs/2504.14064)
Keywords: security, defense, attack
Abstract: We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and $\tau$-bench (for tool calling agents); 2) It is configurable and allows for detailed threat modeling, allowing configuration of specific components of the agentic framework being attackable, and specifying targets for the attacker; and 3) It is modular and decouples the development of attacks from details of the environment in which the agent is deployed, allowing for the same attacks to be applied across multiple environments. We illustrate several advantages of our framework, including the ability to adapt to new threat models and environments easily, the ability to easily combine several previously published attacks to enable comprehensive and fine-grained security testing, and the ability to analyze trade-offs between various vulnerabilities and performance. We apply DoomArena to state-of-the-art (SOTA) web and tool-calling agents and find a number of surprising results: 1) SOTA agents have varying levels of vulnerability to different threat models (malicious user vs malicious environment), and there is no Pareto dominant agent across all threat models; 2) When multiple attacks are applied to an agent, they often combine constructively; 3) Guardrail model-based defenses seem to fail, while defenses based on powerful SOTA LLMs work better. DoomArena is available at this https URL.

Title: Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement

Authors: K M Sajjadul Islam, Ravi Teja Karri, Srujan Vegesna, Jiawei Wu, Praveen Madiraju
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2504.14068
Pdf URL: https://arxiv.org/pdf/2504.14068
Copy Paste: [[2504.14068]] Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement(https://arxiv.org/abs/2504.14068)
Keywords: interpretability
Abstract: Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA. A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon. To delve deeper and analyze dominant topics in feedback, we explored traditional topic modeling methods, including Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM), alongside BERTopic, an advanced neural embedding-based clustering approach. To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT, an integration of BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) for topic interpretability and average Inverted Rank-Biased Overlap (IRBOavg) for topic diversity. Results indicate that kBERT achieves the highest coherence (Cv = 0.53) and distinct topic separation (IRBOavg = 1.00), outperforming all other models in short-text healthcare feedback analysis. Our findings emphasize the importance of embedding-based techniques for topic identification and highlight the need for context-aware models in healthcare analytics.

Title: Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design

Authors: Wei Dong, Yan Min, Han Zhou, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14075
Pdf URL: https://arxiv.org/pdf/2504.14075
Copy Paste: [[2504.14075]] Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design(https://arxiv.org/abs/2504.14075)
Keywords: robust, extraction, transformer
Abstract: Current Low-light Image Enhancement (LLIE) techniques predominantly rely on either direct Low-Light (LL) to Normal-Light (NL) mappings or guidance from semantic features or illumination maps. Nonetheless, the intrinsic ill-posedness of LLIE and the difficulty in retrieving robust semantics from heavily corrupted images hinder their effectiveness in extremely low-light environments. To tackle this challenge, we present SG-LLIE, a new multi-scale CNN-Transformer hybrid framework guided by structure priors. Different from employing pre-trained models for the extraction of semantics or illumination maps, we choose to extract robust structure priors based on illumination-invariant edge detectors. Moreover, we develop a CNN-Transformer Hybrid Structure-Guided Feature Extractor (HSGFE) module at each scale with in the UNet encoder-decoder architecture. Besides the CNN blocks which excels in multi-scale feature extraction and fusion, we introduce a Structure-Guided Transformer Block (SGTB) in each HSGFE that incorporates structural priors to modulate the enhancement process. Extensive experiments show that our method achieves state-of-the-art performance on several LLIE benchmarks in both quantitative metrics and visual quality. Our solution ranks second in the NTIRE 2025 Low-Light Enhancement Challenge. Code is released at this https URL.

Title: LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

Authors: Kang He, Kaushik Roy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14089
Pdf URL: https://arxiv.org/pdf/2504.14089
Copy Paste: [[2504.14089]] LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models(https://arxiv.org/abs/2504.14089)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.

Title: Retinex-guided Histogram Transformer for Mask-free Shadow Removal

Authors: Wei Dong, Han Zhou, Seyed Amirreza Mousavi, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14092
Pdf URL: https://arxiv.org/pdf/2504.14092
Copy Paste: [[2504.14092]] Retinex-guided Histogram Transformer for Mask-free Shadow Removal(https://arxiv.org/abs/2504.14092)
Keywords: transformer
Abstract: While deep learning methods have achieved notable progress in shadow removal, many existing approaches rely on shadow masks that are difficult to obtain, limiting their generalization to real-world scenes. In this work, we propose ReHiT, an efficient mask-free shadow removal framework based on a hybrid CNN-Transformer architecture guided by Retinex theory. We first introduce a dual-branch pipeline to separately model reflectance and illumination components, and each is restored by our developed Illumination-Guided Hybrid CNN-Transformer (IG-HCT) module. Second, besides the CNN-based blocks that are capable of learning residual dense features and performing multi-scale semantic fusion, multi-scale semantic fusion, we develop the Illumination-Guided Histogram Transformer Block (IGHB) to effectively handle non-uniform illumination and spatially complex shadows. Extensive experiments on several benchmark datasets validate the effectiveness of our approach over existing mask-free methods. Trained solely on the NTIRE 2025 Shadow Removal Challenge dataset, our solution delivers competitive results with one of the smallest parameter sizes and fastest inference speeds among top-ranked entries, highlighting its applicability for real-world applications with limited computational resources. The code is available at this https URL.

Title: Leakage and Interpretability in Concept-Based Models

Authors: Enrico Parisini, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, Christopher R. S. Banerji
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14094
Pdf URL: https://arxiv.org/pdf/2504.14094
Copy Paste: [[2504.14094]] Leakage and Interpretability in Concept-Based Models(https://arxiv.org/abs/2504.14094)
Keywords: robust, interpretability
Abstract: Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and quantify leakage, and define two complementary measures: the concepts-task leakage (CTL) and interconcept leakage (ICL) scores. We show that these measures are strongly predictive of model behaviour under interventions and outperform existing alternatives in robustness and reliability. Using this framework, we identify the primary causes of leakage and provide strong evidence that Concept Embedding Models exhibit substantial leakage regardless of the hyperparameters choice. Finally, we propose practical guidelines for designing concept-based models to reduce leakage and ensure interpretability.

Title: VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Authors: Yogesh Kulkarni, Pooyan Fazli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14096
Pdf URL: https://arxiv.org/pdf/2504.14096
Copy Paste: [[2504.14096]] VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment(https://arxiv.org/abs/2504.14096)
Keywords: robust
Abstract: Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics. Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on LongVideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution that seamlessly integrates with existing models while preserving their capabilities.

Title: Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang, Yuelong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14108
Pdf URL: https://arxiv.org/pdf/2504.14108
Copy Paste: [[2504.14108]] Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models(https://arxiv.org/abs/2504.14108)
Keywords: diffusion, generative
Abstract: We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.

Title: Lightweight Road Environment Segmentation using Vector Quantization

Authors: Jiyong Kwag, Alper Yilmaz, Charles Toth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14113
Pdf URL: https://arxiv.org/pdf/2504.14113
Copy Paste: [[2504.14113]] Lightweight Road Environment Segmentation using Vector Quantization(https://arxiv.org/abs/2504.14113)
Keywords: transformer, segmentation
Abstract: Road environment segmentation plays a significant role in autonomous driving. Numerous works based on Fully Convolutional Networks (FCNs) and Transformer architectures have been proposed to leverage local and global contextual learning for efficient and accurate semantic segmentation. In both architectures, the encoder often relies heavily on extracting continuous representations from the image, which limits the ability to represent meaningful discrete information. To address this limitation, we propose segmentation of the autonomous driving environment using vector quantization. Vector quantization offers three primary advantages for road environment segmentation. (1) Each continuous feature from the encoder is mapped to a discrete vector from the codebook, helping the model discover distinct features more easily than with complex continuous features. (2) Since a discrete feature acts as compressed versions of the encoder's continuous features, they also compress noise or outliers, enhancing the image segmentation task. (3) Vector quantization encourages the latent space to form coarse clusters of continuous features, forcing the model to group similar features, making the learned representations more structured for the decoding process. In this work, we combined vector quantization with the lightweight image segmentation model MobileUNETR and used it as a baseline model for comparison to demonstrate its efficiency. Through experiments, we achieved 77.0 % mIoU on Cityscapes, outperforming the baseline by 2.9 % without increasing the model's initial size or complexity.

Title: PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models

Authors: Nusrat Jahan Prottasha, Upama Roy Chowdhury, Shetu Mohanto, Tasfia Nuzhat, Abdullah As Sami, Md Shamol Ali, Md Shohanur Islam Sobuj, Hafijur Raman, Md Kowsher, Ozlem Ozmen Garibay
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14117
Pdf URL: https://arxiv.org/pdf/2504.14117
Copy Paste: [[2504.14117]] PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models(https://arxiv.org/abs/2504.14117)
Keywords: robust, federate, interpretability, generative, large language model
Abstract: Large models such as Large Language Models (LLMs) and Vision Language Models (VLMs) have transformed artificial intelligence, powering applications in natural language processing, computer vision, and multimodal learning. However, fully fine-tuning these models remains expensive, requiring extensive computational resources, memory, and task-specific data. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a promising solution that allows adapting large models to downstream tasks by updating only a small portion of parameters. This survey presents a comprehensive overview of PEFT techniques, focusing on their motivations, design principles, and effectiveness. We begin by analyzing the resource and accessibility challenges posed by traditional fine-tuning and highlight key issues, such as overfitting, catastrophic forgetting, and parameter inefficiency. We then introduce a structured taxonomy of PEFT methods -- grouped into additive, selective, reparameterized, hybrid, and unified frameworks -- and systematically compare their mechanisms and trade-offs. Beyond taxonomy, we explore the impact of PEFT across diverse domains, including language, vision, and generative modeling, showing how these techniques offer strong performance with lower resource costs. We also discuss important open challenges in scalability, interpretability, and robustness, and suggest future directions such as federated learning, domain adaptation, and theoretical grounding. Our goal is to provide a unified understanding of PEFT and its growing role in enabling practical, efficient, and sustainable use of large models.

Title: Detecting Zero-Day Web Attacks with an Ensemble of LSTM, GRU, and Stacked Autoencoders

Authors: Vahid Babaey, Hamid Reza Faragardi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14122
Pdf URL: https://arxiv.org/pdf/2504.14122
Copy Paste: [[2504.14122]] Detecting Zero-Day Web Attacks with an Ensemble of LSTM, GRU, and Stacked Autoencoders(https://arxiv.org/abs/2504.14122)
Keywords: security, attack
Abstract: The rapid growth in web-based services has significantly increased security risks related to user information, as web-based attacks become increasingly sophisticated and prevalent. Traditional security methods frequently struggle to detect previously unknown (zero-day) web attacks, putting sensitive user data at significant risk. Additionally, reducing human intervention in web security tasks can minimize errors and enhance reliability. This paper introduces an intelligent system designed to detect zero-day web attacks using a novel one-class ensemble method consisting of three distinct autoencoder architectures: LSTM autoencoder, GRU autoencoder, and stacked autoencoder. Our approach employs a novel tokenization strategy to convert normal web requests into structured numeric sequences, enabling the ensemble model to effectively identify anomalous activities by uniquely concatenating and compressing the latent representations from each autoencoder. The proposed method efficiently detects unknown web attacks while effectively addressing common limitations of previous methods, such as high memory consumption and excessive false positive rates. Extensive experimental evaluations demonstrate the superiority of our proposed ensemble, achieving remarkable detection metrics: 97.58% accuracy, 97.52% recall, 99.76% specificity, and 99.99% precision, with an exceptionally low false positive rate of 0.2%. These results underscore our method's significant potential in enhancing real-world web security through accurate and reliable detection of web-based attacks.

Title: BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution

Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14129
Pdf URL: https://arxiv.org/pdf/2504.14129
Copy Paste: [[2504.14129]] BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution(https://arxiv.org/abs/2504.14129)
Keywords: generative
Abstract: The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen generators in a fine-grained manner. In this paper, we propose a novel bi-modal guided multi-perspective representation learning (BMRL) framework for zero-shot deepfake attribution (ZS-DFA), which facilitates effective traceability to unseen generators. Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views (i.e., image, noise, and edge). We devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via vision-parsing matching. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual forgery representation learning through vision-language alignment. Additionally, we present a novel deepfake attribution contrastive center (DFACC) loss, to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results demonstrate that our method outperforms the state-of-the-art on the ZS-DFA task through various protocols evaluation.

Title: HFBRI-MAE: Handcrafted Feature Based Rotation-Invariant Masked Autoencoder for 3D Point Cloud Analysis

Authors: Xuanhua Yin, Dingxin Zhang, Jianhui Yu, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14132
Pdf URL: https://arxiv.org/pdf/2504.14132
Copy Paste: [[2504.14132]] HFBRI-MAE: Handcrafted Feature Based Rotation-Invariant Masked Autoencoder for 3D Point Cloud Analysis(https://arxiv.org/abs/2504.14132)
Keywords: robust, segmentation
Abstract: Self-supervised learning (SSL) has demonstrated remarkable success in 3D point cloud analysis, particularly through masked autoencoders (MAEs). However, existing MAE-based methods lack rotation invariance, leading to significant performance degradation when processing arbitrarily rotated point clouds in real-world scenarios. To address this limitation, we introduce Handcrafted Feature-Based Rotation-Invariant Masked Autoencoder (HFBRI-MAE), a novel framework that refines the MAE design with rotation-invariant handcrafted features to ensure stable feature learning across different orientations. By leveraging both rotation-invariant local and global features for token embedding and position embedding, HFBRI-MAE effectively eliminates rotational dependencies while preserving rich geometric structures. Additionally, we redefine the reconstruction target to a canonically aligned version of the input, mitigating rotational ambiguities. Extensive experiments on ModelNet40, ScanObjectNN, and ShapeNetPart demonstrate that HFBRI-MAE consistently outperforms existing methods in object classification, segmentation, and few-shot learning, highlighting its robustness and strong generalization ability in real-world 3D applications.

Title: Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

Authors: Hangyu Liu, Bo Peng, Pengxiang Ding, Donglin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14137
Pdf URL: https://arxiv.org/pdf/2504.14137
Copy Paste: [[2504.14137]] Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach(https://arxiv.org/abs/2504.14137)
Keywords: defense, attack, diffusion, generative
Abstract: Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. Existing generative approaches for multi-target attacks mainly analyze the effect of the use of target labels on noise generation from a theoretical perspective, lacking practical validation and comprehensive summarization. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (2D-TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments on the standard ImageNet dataset demonstrate that 2D-TGAF consistently surpasses state-of-the-art methods in attack success rates, both on normally trained models and across various defense mechanisms.

Title: Segment Any Crack: Deep Semantic Segmentation Adaptation for Crack Detection

Authors: Ghodsiyeh Rostami, Po-Han Chen, Mahdi S. Hosseini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14138
Pdf URL: https://arxiv.org/pdf/2504.14138
Copy Paste: [[2504.14138]] Segment Any Crack: Deep Semantic Segmentation Adaptation for Crack Detection(https://arxiv.org/abs/2504.14138)
Keywords: segmentation
Abstract: Image-based crack detection algorithms are increasingly in demand in infrastructure monitoring, as early detection of cracks is of paramount importance for timely maintenance planning. While deep learning has significantly advanced crack detection algorithms, existing models often require extensive labeled datasets and high computational costs for fine-tuning, limiting their adaptability across diverse conditions. This study introduces an efficient selective fine-tuning strategy, focusing on tuning normalization components, to enhance the adaptability of segmentation models for crack detection. The proposed method is applied to the Segment Anything Model (SAM) and five well-established segmentation models. Experimental results demonstrate that selective fine-tuning of only normalization parameters outperforms full fine-tuning and other common fine-tuning techniques in both performance and computational efficiency, while improving generalization. The proposed approach yields a SAM-based model, Segment Any Crack (SAC), achieving a 61.22\% F1-score and 44.13\% IoU on the OmniCrack30k benchmark dataset, along with the highest performance across three zero-shot datasets and the lowest standard deviation. The results highlight the effectiveness of the adaptation approach in improving segmentation accuracy while significantly reducing computational overhead.

Title: ThyroidEffi 1.0: A Cost-Effective System for High-Performance Multi-Class Thyroid Carcinoma Classification

Authors: Hai Pham-Ngoc, De Nguyen-Van, Dung Vu-Tien, Phuong Le-Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14139
Pdf URL: https://arxiv.org/pdf/2504.14139
Copy Paste: [[2504.14139]] ThyroidEffi 1.0: A Cost-Effective System for High-Performance Multi-Class Thyroid Carcinoma Classification(https://arxiv.org/abs/2504.14139)
Keywords: extraction, interpretability, transformer
Abstract: Background: Automated classification of thyroid fine needle aspiration biopsy (FNAB) images faces challenges in limited data, inter-observer variability, and computational cost. Efficient, interpretable models are crucial for clinical support. Objective: To develop and externally validate a deep learning system for the multi-class classification of thyroid FNAB images into three key categories that directly guide post-biopsy treatment decisions in Vietnam: benign (B2), suspicious for malignancy (B5), and malignant (B6), while achieving high diagnostic accuracy with low computational overhead. Methods: Our framework features: (1) YOLOv10-based cell cluster detection for informative sub-region extraction and noise reduction; (2) a curriculum learning-inspired protocol sequencing localized crops to full images for multi-scale feature capture; (3) adaptive lightweight EfficientNetB0 (4 millions parameters) selection balancing performance and efficiency; and (4) a Transformer-inspired module for multi-scale, multi-region analysis. External validation used 1,015 independent FNAB images. Results: ThyroidEffi Basic achieved a macro F1 of 89.19\% and AUCs of 0.98 (B2), 0.95 (B5), and 0.96 (B6) on the internal test set. External validation yielded AUCs of 0.9495 (B2), 0.7436 (B5), and 0.8396 (B6). ThyroidEffi Premium improved macro F1 to 89.77\%. Grad-CAM highlighted key diagnostic regions, confirming interpretability. The system processed 1000 cases in 30 seconds, demonstrating feasibility on widely accessible hardware like a 12-core CPU. Conclusions: This work demonstrates that high-accuracy, interpretable thyroid FNAB image classification is achievable with minimal computational demands.

Title: Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Authors: Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14150
Pdf URL: https://arxiv.org/pdf/2504.14150
Copy Paste: [[2504.14150]] Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations(https://arxiv.org/abs/2504.14150)
Keywords: large language model
Abstract: Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Title: Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Authors: Sergio Arnaud, Paul McVay, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2504.14151
Pdf URL: https://arxiv.org/pdf/2504.14151
Copy Paste: [[2504.14151]] Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D(https://arxiv.org/abs/2504.14151)
Keywords: robust
Abstract: We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

Title: SConU: Selective Conformal Uncertainty in Large Language Models

Authors: Zhiyuan Wang, Qingni Wang, Yue Zhang, Tianlong Chen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14154
Pdf URL: https://arxiv.org/pdf/2504.14154
Copy Paste: [[2504.14154]] SConU: Selective Conformal Uncertainty in Large Language Models(https://arxiv.org/abs/2504.14154)
Keywords: large language model
Abstract: As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.

Title: ROFBS$α$: Real Time Backup System Decoupled from ML Based Ransomware Detection

Authors: Kosuke Higuchi, Ryotaro Kobayashi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14162
Pdf URL: https://arxiv.org/pdf/2504.14162
Copy Paste: [[2504.14162]] ROFBS$α$: Real Time Backup System Decoupled from ML Based Ransomware Detection(https://arxiv.org/abs/2504.14162)
Keywords: protect, defense
Abstract: This study introduces ROFBS$\alpha$, a new defense architecture that addresses delays in detection in ransomware detectors based on machine learning. It builds on our earlier Real Time Open File Backup System, ROFBS, by adopting an asynchronous design that separates backup operations from detection tasks. By using eBPF to monitor file open events and running the backup process independently, the system avoids performance limitations when detection and protection contend for resources. We evaluated ROFBS$\alpha$ against three ransomware strains, AvosLocker, Conti, and IceFire. The evaluation measured the number of files encrypted, the number of files successfully backed up, the ratio of backups to encrypted files, and the overall detection latency. The results show that ROFBS$\alpha$ achieves high backup success rates and faster detection while adding minimal extra load to the system. However, defending against ransomware that encrypts files extremely quickly remains an open challenge that will require further enhancements.

Title: Self-Correction Makes LLMs Better Parsers

Authors: Ziyan Zhang, Yang Hou, Chen Gong, Zhenghua Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14165
Pdf URL: https://arxiv.org/pdf/2504.14165
Copy Paste: [[2504.14165]] Self-Correction Makes LLMs Better Parsers(https://arxiv.org/abs/2504.14165)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. However, recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding, particularly syntactic parsing. In this paper, we conduct an in-depth analysis of LLM parsing capabilities, delving into the specific shortcomings of their parsing results. We find that LLMs may stem from limitations to fully leverage grammar rules in existing treebanks, which restricts their capability to generate valid syntactic structures. To help LLMs acquire knowledge without additional training, we propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors. Specifically, we automatically detect potential errors and dynamically search for relevant rules, offering hints and examples to guide LLMs in making corrections themselves. Experimental results on three datasets with various LLMs, demonstrate that our method significantly improves performance in both in-domain and cross-domain settings on the English and Chinese datasets.

Title: A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences

Authors: Jing Han, Hanting Chen, Kai Han, Xiaomeng Huang, Yongyun Hu, Wenjun Xu, Dacheng Tao, Ping Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14174
Pdf URL: https://arxiv.org/pdf/2504.14174
Copy Paste: [[2504.14174]] A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences(https://arxiv.org/abs/2504.14174)
Keywords: interpretability, transformer
Abstract: With the rapid development of machine learning in recent years, many problems in meteorology can now be addressed using AI models. In particular, data-driven algorithms have significantly improved accuracy compared to traditional methods. Meteorological data is often transformed into 2D images or 3D videos, which are then fed into AI models for learning. Additionally, these models often incorporate physical signals, such as temperature, pressure, and wind speed, to further enhance accuracy and interpretability. In this paper, we review several representative AI + Weather/Climate algorithms and propose a new paradigm where observational data from different perspectives, each with distinct physical meanings, are treated as multimodal data and integrated via transformers. Furthermore, key weather and climate knowledge can be incorporated through regularization techniques to further strengthen the model's capabilities. This new paradigm is versatile and can address a variety of tasks, offering strong generalizability. We also discuss future directions for improving model accuracy and interpretability.

Title: Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Authors: Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.14175
Pdf URL: https://arxiv.org/pdf/2504.14175
Copy Paste: [[2504.14175]] Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion(https://arxiv.org/abs/2504.14175)
Keywords: large language model
Abstract: Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyzed whether the generated documents contained information entailed by ground truth evidence and assessed their impact on performance. Our findings indicate that performance improvements occurred consistently only for claims whose generated documents included sentences entailed by ground truth evidence. This suggests that knowledge leakage may be present in these benchmarks, inflating the perceived performance of LLM-based query expansion methods, particularly in real-world scenarios that require retrieving niche or novel knowledge.

Title: Segregation and Context Aggregation Network for Real-time Cloud Segmentation

Authors: Yijie Li, Hewei Wang, Jiayi Zhang, Jinjiang You, Jinfeng Xu, Puzhen Wu, Yunzhong Xiao, Soumyabrata Dev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14178
Pdf URL: https://arxiv.org/pdf/2504.14178
Copy Paste: [[2504.14178]] Segregation and Context Aggregation Network for Real-time Cloud Segmentation(https://arxiv.org/abs/2504.14178)
Keywords: segmentation
Abstract: Cloud segmentation from intensity images is a pivotal task in atmospheric science and computer vision, aiding weather forecasting and climate analysis. Ground-based sky/cloud segmentation extracts clouds from images for further feature analysis. Existing methods struggle to balance segmentation accuracy and computational efficiency, limiting real-world deployment on edge devices, so we introduce SCANet, a novel lightweight cloud segmentation model featuring Segregation and Context Aggregation Module (SCAM), which refines rough segmentation maps into weighted sky and cloud features processed separately. SCANet achieves state-of-the-art performance while drastically reducing computational complexity. SCANet-large (4.29M) achieves comparable accuracy to state-of-the-art methods with 70.9% fewer parameters. Meanwhile, SCANet-lite (90K) delivers 1390 fps in FP16, surpassing real-time standards. Additionally, we propose an efficient pre-training strategy that enhances performance even without ImageNet pre-training.

Title: FedC4: Graph Condensation Meets Client-Client Collaboration for Efficient and Private Federated Graph Learning

Authors: Zekai Chen, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14188
Pdf URL: https://arxiv.org/pdf/2504.14188
Copy Paste: [[2504.14188]] FedC4: Graph Condensation Meets Client-Client Collaboration for Efficient and Private Federated Graph Learning(https://arxiv.org/abs/2504.14188)
Keywords: privacy, federate
Abstract: Federated Graph Learning (FGL) is an emerging distributed learning paradigm that enables collaborative model training over decentralized graph-structured data while preserving local privacy. Existing FGL methods can be categorized into two optimization architectures: (1) the Server-Client (S-C) paradigm, where clients upload local models for server-side aggregation; and (2) the Client-Client (C-C) paradigm, which allows direct information exchange among clients to support personalized training. Compared to S-C, the C-C architecture better captures global graph knowledge and enables fine-grained optimization through customized peer-to-peer communication. However, current C-C methods often broadcast identical and redundant node embeddings, incurring high communication costs and privacy risks. To address this, we propose FedC4, a novel framework that combines graph Condensation with Client-Client Collaboration. Instead of transmitting raw node-level features, FedC4 distills each client's private graph into a compact set of synthetic node embeddings, reducing communication overhead and enhancing privacy. In addition, FedC4 introduces three modules that allow source clients to send distinct node representations tailored to target clients'graph structures, enabling personalized optimization with global guidance. Extensive experiments on eight real-world datasets show that FedC4 outperforms state-of-the-art baselines in both performance and communication efficiency.

Title: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Authors: Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14194
Pdf URL: https://arxiv.org/pdf/2504.14194
Copy Paste: [[2504.14194]] Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models(https://arxiv.org/abs/2504.14194)
Keywords: large language model
Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

Title: Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

Authors: Zichuan Liu, Liming Jiang, Qing Yan, Yumin Jia, Hao Kang, Xin Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14202
Pdf URL: https://arxiv.org/pdf/2504.14202
Copy Paste: [[2504.14202]] Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis(https://arxiv.org/abs/2504.14202)
Keywords: diffusion
Abstract: We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.

Title: DConAD: A Differencing-based Contrastive Representation Learning Framework for Time Series Anomaly Detection

Authors: Wenxin Zhang, Xiaojian Lin, Wenjun Yu, Guangzhen Yao, jingxiang Zhong, Yu Li, Renda Han, Songcheng Xu, Hao Shi, Cuicui Luo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14204
Pdf URL: https://arxiv.org/pdf/2504.14204
Copy Paste: [[2504.14204]] DConAD: A Differencing-based Contrastive Representation Learning Framework for Time Series Anomaly Detection(https://arxiv.org/abs/2504.14204)
Keywords: robust, transformer
Abstract: Time series anomaly detection holds notable importance for risk identification and fault detection across diverse application domains. Unsupervised learning methods have become popular because they have no requirement for labels. However, due to the challenges posed by the multiplicity of abnormal patterns, the sparsity of anomalies, and the growth of data scale and complexity, these methods often fail to capture robust and representative dependencies within the time series for identifying anomalies. To enhance the ability of models to capture normal patterns of time series and avoid the retrogression of modeling ability triggered by the dependencies on high-quality prior knowledge, we propose a differencing-based contrastive representation learning framework for time series anomaly detection (DConAD). Specifically, DConAD generates differential data to provide additional information about time series and utilizes transformer-based architecture to capture spatiotemporal dependencies, which enhances the robustness of unbiased representation learning ability. Furthermore, DConAD implements a novel KL divergence-based contrastive learning paradigm that only uses positive samples to avoid deviation from reconstruction and deploys the stop-gradient strategy to compel convergence. Extensive experiments on five public datasets show the superiority and effectiveness of DConAD compared with nine baselines. The code is available at this https URL.

Title: Decomposition-based multi-scale transformer framework for time series anomaly detection

Authors: Wenxin Zhang, Cuicui Luo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14206
Pdf URL: https://arxiv.org/pdf/2504.14206
Copy Paste: [[2504.14206]] Decomposition-based multi-scale transformer framework for time series anomaly detection(https://arxiv.org/abs/2504.14206)
Keywords: transformer
Abstract: Time series anomaly detection is crucial for maintaining stable systems. Existing methods face two main challenges. First, it is difficult to directly model the dependencies of diverse and complex patterns within the sequences. Second, many methods that optimize parameters using mean squared error struggle with noise in the time series, leading to performance deterioration. To address these challenges, we propose a transformer-based framework built on decomposition (TransDe) for multivariate time series anomaly detection. The key idea is to combine the strengths of time series decomposition and transformers to effectively learn the complex patterns in normal time series data. A multi-scale patch-based transformer architecture is proposed to exploit the representative dependencies of each decomposed component of the time series. Furthermore, a contrastive learn paradigm based on patch operation is proposed, which leverages KL divergence to align the positive pairs, namely the pure representations of normal patterns between different patch-level views. A novel asynchronous loss function with a stop-gradient strategy is further introduced to enhance the performance of TransDe effectively. It can avoid time-consuming and labor-intensive computation costs in the optimization process. Extensive experiments on five public datasets are conducted and TransDe shows superiority compared with twelve baselines in terms of F1 score. Our code is available at this https URL.

Title: Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

Authors: Takuma Udagawa, Yang Zhao, Hiroshi Kanayama, Bishwaranjan Bhattacharjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14212
Pdf URL: https://arxiv.org/pdf/2504.14212
Copy Paste: [[2504.14212]] Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification(https://arxiv.org/abs/2504.14212)
Keywords: protect, large language model
Abstract: Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

Title: Understanding the Repeat Curse in Large Language Models from a Feature Perspective

Authors: Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14218
Pdf URL: https://arxiv.org/pdf/2504.14218
Copy Paste: [[2504.14218]] Understanding the Repeat Curse in Large Language Models from a Feature Perspective(https://arxiv.org/abs/2504.14218)
Keywords: extraction, interpretability, large language model
Abstract: Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse". While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, "Duplicatus Charm", to induce and analyze the Repeat Curse. Our method systematically identifies "Repetition Features" -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse.

Title: From Cyber Security Incident Management to Cyber Security Crisis Management in the European Union

Authors: Jukka Ruohonen, Kalle Rindell, Simone Busetti
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14220
Pdf URL: https://arxiv.org/pdf/2504.14220
Copy Paste: [[2504.14220]] From Cyber Security Incident Management to Cyber Security Crisis Management in the European Union(https://arxiv.org/abs/2504.14220)
Keywords: security
Abstract: Incident management is a classical topic in cyber security. Recently, the European Union (EU) has started to consider also the relation between cyber security incidents and cyber security crises. These considerations and preparations, including those specified in the EU's new cyber security laws, constitute the paper's topic. According to an analysis of the laws and associated policy documents, (i) cyber security crises are equated in the EU to large-scale cyber security incidents that either exceed a handling capacity of a single member state or affect at least two member states. For this and other purposes, (ii) the new laws substantially increase mandatory reporting about cyber security incidents, including but not limited to the large-scale incidents. Despite the laws and new governance bodies established by them, however, (iii) the working of actual cyber security crisis management remains unclear particularly at the EU-level. With these policy research results, the paper advances the domain of cyber security incident management research by elaborating how European law perceives cyber security crises and their relation to cyber security incidents, paving the way for many relevant further research topics with practical relevance, whether theoretical, conceptual, or empirical.

Title: Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

Authors: Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14221
Pdf URL: https://arxiv.org/pdf/2504.14221
Copy Paste: [[2504.14221]] Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection(https://arxiv.org/abs/2504.14221)
Keywords: robust
Abstract: The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at this https URL D3

Title: SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification

Authors: Michael Färber, Parisa Aghdam, Kyuri Im, Mario Tawfelis, Hardik Ghoshal
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.14223
Pdf URL: https://arxiv.org/pdf/2504.14223
Copy Paste: [[2504.14223]] SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification(https://arxiv.org/abs/2504.14223)
Keywords: large language model
Abstract: Text simplification is essential for making complex content accessible to diverse audiences who face comprehension challenges. Yet, the limited availability of simplified materials creates significant barriers to personal and professional growth and hinders social inclusion. Although researchers have explored various methods for automatic text simplification, none fully leverage large language models (LLMs) to offer tailored customization for different target groups and varying levels of simplicity. Moreover, despite its proven benefits for both consumers and organizations, the well-established practice of plain language remains underutilized. In this paper, we this https URL, the first system designed to produce plain language content from multiple input formats, including typed text and file uploads, with flexible customization options for diverse audiences. We employ GPT-4 and Llama-3 and evaluate outputs across multiple metrics. Overall, our work contributes to research on automatic text simplification and highlights the importance of tailored communication in promoting inclusivity.

Title: Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Authors: Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14225
Pdf URL: https://arxiv.org/pdf/2504.14225
Copy Paste: [[2504.14225]] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale(https://arxiv.org/abs/2504.14225)
Keywords: large language model
Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at this http URL.

Title: Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Authors: Johannes Spoecklberger, Wei Lin, Pedro Hermosilla, Sivan Doveh, Horst Possegger, M. Jehanzeb Mirza
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14231
Pdf URL: https://arxiv.org/pdf/2504.14231
Copy Paste: [[2504.14231]] Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation(https://arxiv.org/abs/2504.14231)
Keywords: robust, segmentation
Abstract: Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

Title: A Novel Frequency-Spatial Domain Aware Network for Fast Thermal Prediction in 2.5D ICs

Authors: Dekang Zhang, Dan Niu, Zhou Jin, Yichao Dong, Jingweijia Tan, Changyin Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14237
Pdf URL: https://arxiv.org/pdf/2504.14237
Copy Paste: [[2504.14237]] A Novel Frequency-Spatial Domain Aware Network for Fast Thermal Prediction in 2.5D ICs(https://arxiv.org/abs/2504.14237)
Keywords: robust, extraction
Abstract: In the post-Moore era, 2.5D chiplet-based ICs present significant challenges in thermal management due to increased power density and thermal hotspots. Neural network-based thermal prediction models can perform real-time predictions for many unseen new designs. However, existing CNN-based and GCN-based methods cannot effectively capture the global thermal features, especially for high-frequency components, hindering prediction accuracy enhancement. In this paper, we propose a novel frequency-spatial dual domain aware prediction network (FSA-Heat) for fast and high-accuracy thermal prediction in 2.5D ICs. It integrates high-to-low frequency and spatial domain encoder (FSTE) module with frequency domain cross-scale interaction module (FCIFormer) to achieve high-to-low frequency and global-to-local thermal dissipation feature extraction. Additionally, a frequency-spatial hybrid loss (FSL) is designed to effectively attenuate high-frequency thermal gradient noise and spatial misalignments. The experimental results show that the performance enhancements offered by our proposed method are substantial, outperforming the newly-proposed 2.5D method, GCN+PNA, by considerable margins (over 99% RMSE reduction, 4.23X inference time speedup). Moreover, extensive experiments demonstrate that FSA-Heat also exhibits robust generalization capabilities.

Title: Single Document Image Highlight Removal via A Large-Scale Real-World Dataset and A Location-Aware Network

Authors: Lu Pan, Yu-Hsuan Huang, Hongxia Xie, Cheng Zhang, Hongwei Zhao, Hong-Han Shuai, Wen-Huang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14238
Pdf URL: https://arxiv.org/pdf/2504.14238
Copy Paste: [[2504.14238]] Single Document Image Highlight Removal via A Large-Scale Real-World Dataset and A Location-Aware Network(https://arxiv.org/abs/2504.14238)
Keywords: diffusion
Abstract: Reflective documents often suffer from specular highlights under ambient lighting, severely hindering text readability and degrading overall visual quality. Although recent deep learning methods show promise in highlight removal, they remain suboptimal for document images, primarily due to the lack of dedicated datasets and tailored architectural designs. To tackle these challenges, we present DocHR14K, a large-scale real-world dataset comprising 14,902 high-resolution image pairs across six document categories and various lighting conditions. To the best of our knowledge, this is the first high-resolution dataset for document highlight removal that captures a wide range of real-world lighting conditions. Additionally, motivated by the observation that the residual map between highlighted and clean images naturally reveals the spatial structure of highlight regions, we propose a simple yet effective Highlight Location Prior (HLP) to estimate highlight masks without human annotations. Building on this prior, we present the Location-Aware Laplacian Pyramid Highlight Removal Network (L2HRNet), which effectively removes highlights by leveraging estimated priors and incorporates diffusion module to restore details. Extensive experiments demonstrate that DocHR14K improves highlight removal under diverse lighting conditions. Our L2HRNet achieves state-of-the-art performance across three benchmark datasets, including a 5.01\% increase in PSNR and a 13.17\% reduction in RMSE on DocHR14K.

Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Authors: Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14245
Pdf URL: https://arxiv.org/pdf/2504.14245
Copy Paste: [[2504.14245]] Towards Explainable Fake Image Detection with Multi-Modal Large Language Models(https://arxiv.org/abs/2504.14245)
Keywords: security, robust, large language model
Abstract: Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at this https URL.

Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Authors: Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14249
Pdf URL: https://arxiv.org/pdf/2504.14249
Copy Paste: [[2504.14249]] Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation(https://arxiv.org/abs/2504.14249)
Keywords: large language model
Abstract: Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language this http URL, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (this https URL)

Title: ColorVein: Colorful Cancelable Vein Biometrics

Authors: Yifan Wang, Jie Gui, Xinli Shi, Linqing Gui, Yuan Yan Tang, James Tin-Yau Kwok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14253
Pdf URL: https://arxiv.org/pdf/2504.14253
Copy Paste: [[2504.14253]] ColorVein: Colorful Cancelable Vein Biometrics(https://arxiv.org/abs/2504.14253)
Keywords: secure, security, privacy, protect, biometric, extraction
Abstract: Vein recognition technologies have become one of the primary solutions for high-security identification systems. However, the issue of biometric information leakage can still pose a serious threat to user privacy and anonymity. Currently, there is no cancelable biometric template generation scheme specifically designed for vein biometrics. Therefore, this paper proposes an innovative cancelable vein biometric generation scheme: ColorVein. Unlike previous cancelable template generation schemes, ColorVein does not destroy the original biometric features and introduces additional color information to grayscale vein images. This method significantly enhances the information density of vein images by transforming static grayscale information into dynamically controllable color representations through interactive colorization. ColorVein allows users/administrators to define a controllable pseudo-random color space for grayscale vein images by editing the position, number, and color of hint points, thereby generating protected cancelable templates. Additionally, we propose a new secure center loss to optimize the training process of the protected feature extraction model, effectively increasing the feature distance between enrolled users and any potential impostors. Finally, we evaluate ColorVein's performance on all types of vein biometrics, including recognition performance, unlinkability, irreversibility, and revocability, and conduct security and privacy analyses. ColorVein achieves competitive performance compared with state-of-the-art methods.

Title: Visual Consensus Prompting for Co-Salient Object Detection

Authors: Jie Wang, Nana Yu, Zihao Zhang, Yahong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14254
Pdf URL: https://arxiv.org/pdf/2504.14254
Copy Paste: [[2504.14254]] Visual Consensus Prompting for Co-Salient Object Detection(https://arxiv.org/abs/2504.14254)
Keywords: extraction
Abstract: Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at this https URL.

Title: Cross-attention for State-based model RWKV-7

Authors: Liu Xiao, Li Zhiyuan, Lin Yueyu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14260
Pdf URL: https://arxiv.org/pdf/2504.14260
Copy Paste: [[2504.14260]] Cross-attention for State-based model RWKV-7(https://arxiv.org/abs/2504.14260)
Keywords: robust, diffusion, transformer
Abstract: We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state this http URL at this https URL

Title: Generative emulation of chaotic dynamics with coherent prior

Authors: Juan Nathaniel, Pierre Gentine
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14264
Pdf URL: https://arxiv.org/pdf/2504.14264
Copy Paste: [[2504.14264]] Generative emulation of chaotic dynamics with coherent prior(https://arxiv.org/abs/2504.14264)
Keywords: diffusion, generative
Abstract: Data-driven emulation of nonlinear dynamics is challenging due to long-range skill decay that often produces physically unrealistic outputs. Recent advances in generative modeling aim to address these issues by providing uncertainty quantification and correction. However, the quality of generated simulation remains heavily dependent on the choice of conditioning priors. In this work, we present an efficient generative framework for dynamics emulation, unifying principles of turbulence with diffusion-based modeling: Cohesion. Specifically, our method estimates large-scale coherent structure of the underlying dynamics as guidance during the denoising process, where small-scale fluctuation in the flow is then resolved. These coherent priors are efficiently approximated using reduced-order models, such as deep Koopman operators, that allow for rapid generation of long prior sequences while maintaining stability over extended forecasting horizon. With this gain, we can reframe forecasting as trajectory planning, a common task in reinforcement learning, where conditional denoising is performed once over entire sequences, minimizing the computational cost of autoregressive-based generative methods. Empirical evaluations on chaotic systems of increasing complexity, including Kolmogorov flow, shallow water equations, and subseasonal-to-seasonal climate dynamics, demonstrate Cohesion superior long-range forecasting skill that can efficiently generate physically-consistent simulations, even in the presence of partially-observed guidance.

Title: Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Authors: Li Yu, Xuanzhe Sun, Wei Zhou, Moncef Gabbouj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14267
Pdf URL: https://arxiv.org/pdf/2504.14267
Copy Paste: [[2504.14267]] Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction(https://arxiv.org/abs/2504.14267)
Keywords: diffusion, transformer
Abstract: Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.

Title: Mixed-Precision Conjugate Gradient Solvers with RL-Driven Precision Tuning

Authors: Xinye Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14268
Pdf URL: https://arxiv.org/pdf/2504.14268
Copy Paste: [[2504.14268]] Mixed-Precision Conjugate Gradient Solvers with RL-Driven Precision Tuning(https://arxiv.org/abs/2504.14268)
Keywords: robust
Abstract: This paper presents a novel reinforcement learning (RL) framework for dynamically optimizing numerical precision in the preconditioned conjugate gradient (CG) method. By modeling precision selection as a Markov Decision Process (MDP), we employ Q-learning to adaptively assign precision levels to key operations, striking an optimal balance between computational efficiency and numerical accuracy, while ensuring stability through double-precision scalar computations and residual computing. In practice, the algorithm is trained on a set of data and subsequently performs inference for precision selection on out-of-sample data, without requiring re-analysis or retraining for new datasets. This enables the method to adapt seamlessly to new problem instances without the computational overhead of recalibration. Our results demonstrate the effectiveness of RL in enhancing solver's performance, marking the first application of RL to mixed-precision numerical methods. The findings highlight the approach's practical advantages, robustness, and scalability, providing valuable insights into its integration with iterative solvers and paving the way for AI-driven advancements in scientific computing.

Title: RAMCT: Novel Region-adaptive Multi-channel Tracker with Iterative Tikhonov Regularization for Thermal Infrared Tracking

Authors: Shang Zhang, Yuke Hou, Guoqiang Gong, Ruoyan Xiong, Yue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14278
Pdf URL: https://arxiv.org/pdf/2504.14278
Copy Paste: [[2504.14278]] RAMCT: Novel Region-adaptive Multi-channel Tracker with Iterative Tikhonov Regularization for Thermal Infrared Tracking(https://arxiv.org/abs/2504.14278)
Keywords: robust
Abstract: Correlation filter (CF)-based trackers have gained significant attention for their computational efficiency in thermal infrared (TIR) target tracking. However, ex-isting methods struggle with challenges such as low-resolution imagery, occlu-sion, background clutter, and target deformation, which severely impact tracking performance. To overcome these limitations, we propose RAMCT, a region-adaptive sparse correlation filter tracker that integrates multi-channel feature opti-mization with an adaptive regularization strategy. Firstly, we refine the CF learn-ing process by introducing a spatially adaptive binary mask, which enforces spar-sity in the target region while dynamically suppressing background interference. Secondly, we introduce generalized singular value decomposition (GSVD) and propose a novel GSVD-based region-adaptive iterative Tikhonov regularization method. This enables flexible and robust optimization across multiple feature channels, improving resilience to occlusion and background variations. Thirdly, we propose an online optimization strategy with dynamic discrepancy-based pa-rameter adjustment. This mechanism facilitates real time adaptation to target and background variations, thereby improving tracking accuracy and robustness. Ex-tensive experiments on LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that RAMCT outperforms other state-of-the-art trackers in terms of accuracy and robustness.

Title: CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey

Authors: Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, Irwin King
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14280
Pdf URL: https://arxiv.org/pdf/2504.14280
Copy Paste: [[2504.14280]] CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey(https://arxiv.org/abs/2504.14280)
Keywords: robust, extraction
Abstract: As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for enhancing model robustness across diverse environments. Contrastive Language-Image Pretraining (CLIP) plays a significant role in these tasks, offering powerful zero-shot capabilities that allow models to perform effectively in unseen domains. However, there remains a significant gap in the literature, as no comprehensive survey currently exists that systematically explores the applications of CLIP in DG and DA, highlighting the necessity for this review. This survey presents a comprehensive review of CLIP's applications in DG and DA. In DG, we categorize methods into optimizing prompt learning for task alignment and leveraging CLIP as a backbone for effective feature extraction, both enhancing model adaptability. For DA, we examine both source-available methods utilizing labeled source data and source-free approaches primarily based on target domain data, emphasizing knowledge transfer mechanisms and strategies for improved performance across diverse contexts. Key challenges, including overfitting, domain diversity, and computational efficiency, are addressed, alongside future research opportunities to advance robustness and efficiency in practical applications. By synthesizing existing literature and pinpointing critical gaps, this survey provides valuable insights for researchers and practitioners, proposing directions for effectively leveraging CLIP to enhance methodologies in domain generalization and adaptation. Ultimately, this work aims to foster innovation and collaboration in the quest for more resilient machine learning models that can perform reliably across diverse real-world scenarios. A more up-to-date version of the papers is maintained at: this https URL.

Title: SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Authors: Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, Bing Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14286
Pdf URL: https://arxiv.org/pdf/2504.14286
Copy Paste: [[2504.14286]] SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM(https://arxiv.org/abs/2504.14286)
Keywords: large language model
Abstract: Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which successfully surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B) and relies solely on RL, without prior Supervised Fine-Tuning (SFT). Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, dedicating to offer valuable insights into scaling LLM reasoning capabilities across diverse tasks.

Title: Probing the Subtle Ideological Manipulation of Large Language Models

Authors: Demetris Paschalides, George Pallis, Marios D. Dikaiakos
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14287
Pdf URL: https://arxiv.org/pdf/2504.14287
Copy Paste: [[2504.14287]] Probing the Subtle Ideological Manipulation of Large Language Models(https://arxiv.org/abs/2504.14287)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation, particularly in politically sensitive areas. Prior work has focused on binary Left-Right LLM biases, using explicit prompts and fine-tuning on political QA datasets. In this work, we move beyond this binary approach to explore the extent to which LLMs can be influenced across a spectrum of political ideologies, from Progressive-Left to Conservative-Right. We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension. By fine-tuning three LLMs-Phi-2, Mistral, and Llama-3-on this dataset, we evaluate their capacity to adopt and express these nuanced ideologies. Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements. This highlights the models' susceptibility to subtle ideological manipulation, suggesting a need for more robust safeguards to mitigate these risks.

Title: From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion

Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Michael Felsberg, Dacheng Tao, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14294
Pdf URL: https://arxiv.org/pdf/2504.14294
Copy Paste: [[2504.14294]] From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion(https://arxiv.org/abs/2504.14294)
Keywords: diffusion
Abstract: Image completion is a challenging task, particularly when ensuring that generated content seamlessly integrates with existing parts of an image. While recent diffusion models have shown promise, they often struggle with maintaining coherence between known and unknown (missing) regions. This issue arises from the lack of explicit spatial and semantic alignment during the diffusion process, resulting in content that does not smoothly integrate with the original image. Additionally, diffusion models typically rely on global learned distributions rather than localized features, leading to inconsistencies between the generated and existing image parts. In this work, we propose ConFill, a novel framework that introduces a Context-Adaptive Discrepancy (CAD) model to ensure that intermediate distributions of known and unknown regions are closely aligned throughout the diffusion process. By incorporating CAD, our model progressively reduces discrepancies between generated and original images at each diffusion step, leading to contextually aligned completion. Moreover, ConFill uses a new Dynamic Sampling mechanism that adaptively increases the sampling rate in regions with high reconstruction complexity. This approach enables precise adjustments, enhancing detail and integration in restored areas. Extensive experiments demonstrate that ConFill outperforms current methods, setting a new benchmark in image completion.

Title: Learning and Generating Diverse Residential Load Patterns Using GAN with Weakly-Supervised Training and Weight Selection

Authors: Xinyu Liang, Hao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14300
Pdf URL: https://arxiv.org/pdf/2504.14300
Copy Paste: [[2504.14300]] Learning and Generating Diverse Residential Load Patterns Using GAN with Weakly-Supervised Training and Weight Selection(https://arxiv.org/abs/2504.14300)
Keywords: generative
Abstract: The scarcity of high-quality residential load data can pose obstacles for decarbonizing the residential sector as well as effective grid planning and operation. The above challenges have motivated research into generating synthetic load data, but existing methods faced limitations in terms of scalability, diversity, and similarity. This paper proposes a Generative Adversarial Network-based Synthetic Residential Load Pattern (RLP-GAN) generation model, a novel weakly-supervised GAN framework, leveraging an over-complete autoencoder to capture dependencies within complex and diverse load patterns and learn household-level data distribution at scale. We incorporate a model weight selection method to address the mode collapse problem and generate load patterns with high diversity. We develop a holistic evaluation method to validate the effectiveness of RLP-GAN using real-world data of 417 households. The results demonstrate that RLP-GAN outperforms state-of-the-art models in capturing temporal dependencies and generating load patterns with higher similarity to real data. Furthermore, we have publicly released the RLP-GAN generated synthetic dataset, which comprises one million synthetic residential load pattern profiles.

Title: Balancing Privacy and Action Performance: A Penalty-Driven Approach to Image Anonymization

Authors: Nazia Aslam, Kamal Nasrollahi
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.14301
Pdf URL: https://arxiv.org/pdf/2504.14301
Copy Paste: [[2504.14301]] Balancing Privacy and Action Performance: A Penalty-Driven Approach to Image Anonymization(https://arxiv.org/abs/2504.14301)
Keywords: privacy, protect
Abstract: The rapid development of video surveillance systems for object detection, tracking, activity recognition, and anomaly detection has revolutionized our day-to-day lives while setting alarms for privacy concerns. It isn't easy to strike a balance between visual privacy and action recognition performance in most computer vision models. Is it possible to safeguard privacy without sacrificing performance? It poses a formidable challenge, as even minor privacy enhancements can lead to substantial performance degradation. To address this challenge, we propose a privacy-preserving image anonymization technique that optimizes the anonymizer using penalties from the utility branch, ensuring improved action recognition performance while minimally affecting privacy leakage. This approach addresses the trade-off between minimizing privacy leakage and maintaining high action performance. The proposed approach is primarily designed to align with the regulatory standards of the EU AI Act and GDPR, ensuring the protection of personally identifiable information while maintaining action performance. To the best of our knowledge, we are the first to introduce a feature-based penalty scheme that exclusively controls the action features, allowing freedom to anonymize private attributes. Extensive experiments were conducted to validate the effectiveness of the proposed method. The results demonstrate that applying a penalty to anonymizer from utility branch enhances action performance while maintaining nearly consistent privacy leakage across different penalty settings.

Title: FGSGT: Saliency-Guided Siamese Network Tracker Based on Key Fine-Grained Feature Information for Thermal Infrared Target Tracking

Authors: Ruoyan Xiong, Huanbin Zhang, Shentao Wang, Hui He, Yuke Hou, Yue Zhang, Yujie Cui, Huipan Guan, Shang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14309
Pdf URL: https://arxiv.org/pdf/2504.14309
Copy Paste: [[2504.14309]] FGSGT: Saliency-Guided Siamese Network Tracker Based on Key Fine-Grained Feature Information for Thermal Infrared Target Tracking(https://arxiv.org/abs/2504.14309)
Keywords: extraction
Abstract: Thermal infrared (TIR) images typically lack detailed features and have low contrast, making it challenging for conventional feature extraction models to capture discriminative target characteristics. As a result, trackers are often affected by interference from visually similar objects and are susceptible to tracking drift. To address these challenges, we propose a novel saliency-guided Siamese network tracker based on key fine-grained feature infor-mation. First, we introduce a fine-grained feature parallel learning convolu-tional block with a dual-stream architecture and convolutional kernels of varying sizes. This design captures essential global features from shallow layers, enhances feature diversity, and minimizes the loss of fine-grained in-formation typically encountered in residual connections. In addition, we propose a multi-layer fine-grained feature fusion module that uses bilinear matrix multiplication to effectively integrate features across both deep and shallow layers. Next, we introduce a Siamese residual refinement block that corrects saliency map prediction errors using residual learning. Combined with deep supervision, this mechanism progressively refines predictions, ap-plying supervision at each recursive step to ensure consistent improvements in accuracy. Finally, we present a saliency loss function to constrain the sali-ency predictions, directing the network to focus on highly discriminative fi-ne-grained features. Extensive experiment results demonstrate that the pro-posed tracker achieves the highest precision and success rates on the PTB-TIR and LSOTB-TIR benchmarks. It also achieves a top accuracy of 0.78 on the VOT-TIR 2015 benchmark and 0.75 on the VOT-TIR 2017 benchmark.

Title: DCFG: Diverse Cross-Channel Fine-Grained Feature Learning and Progressive Fusion Siamese Tracker for Thermal Infrared Target Tracking

Authors: Ruoyan Xiong, Yuke Hou, Princess Retor Torboh, Hui He, Huanbin Zhang, Yue Zhang, Yanpin Wang, Huipan Guan, Shang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14311
Pdf URL: https://arxiv.org/pdf/2504.14311
Copy Paste: [[2504.14311]] DCFG: Diverse Cross-Channel Fine-Grained Feature Learning and Progressive Fusion Siamese Tracker for Thermal Infrared Target Tracking(https://arxiv.org/abs/2504.14311)
Keywords: extraction
Abstract: To address the challenge of capturing highly discriminative features in ther-mal infrared (TIR) tracking, we propose a novel Siamese tracker based on cross-channel fine-grained feature learning and progressive fusion. First, we introduce a cross-channel fine-grained feature learning network that employs masks and suppression coefficients to suppress dominant target features, en-abling the tracker to capture more detailed and subtle information. The net-work employs a channel rearrangement mechanism to enhance efficient in-formation flow, coupled with channel equalization to reduce parameter count. Additionally, we incorporate layer-by-layer combination units for ef-fective feature extraction and fusion, thereby minimizing parameter redun-dancy and computational complexity. The network further employs feature redirection and channel shuffling strategies to better integrate fine-grained details. Second, we propose a specialized cross-channel fine-grained loss function designed to guide feature groups toward distinct discriminative re-gions of the target, thus improving overall target representation. This loss function includes an inter-channel loss term that promotes orthogonality be-tween channels, maximizing feature diversity and facilitating finer detail capture. Extensive experiments demonstrate that our proposed tracker achieves the highest accuracy, scoring 0.81 on the VOT-TIR 2015 and 0.78 on the VOT-TIR 2017 benchmark, while also outperforming other methods across all evaluation metrics on the LSOTB-TIR and PTB-TIR benchmarks.

Title: ScaloWork: Useful Proof-of-Work with Distributed Pool Mining

Authors: Diptendu Chatterjee, Avishek Majumder, Subhra Mazumdar
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2504.14328
Pdf URL: https://arxiv.org/pdf/2504.14328
Copy Paste: [[2504.14328]] ScaloWork: Useful Proof-of-Work with Distributed Pool Mining(https://arxiv.org/abs/2504.14328)
Keywords: secure, security, fair
Abstract: Bitcoin blockchain uses hash-based Proof-of-Work (PoW) that prevents unwanted participants from hogging the network resources. Anyone entering the mining game has to prove that they have expended a specific amount of computational power. However, the most popular Bitcoin blockchain consumes 175.87 TWh of electrical energy annually, and most of this energy is wasted on hash calculations, which serve no additional purpose. Several studies have explored re-purposing the wasted energy by replacing the hash function with meaningful computational problems that have practical applications. Minimum Dominating Set (MDS) in networks has numerous real-life applications. Building on this concept, Chrisimos [TrustCom '23] was proposed to replace hash-based PoW with the computation of a dominating set on real-life graph instances. However, Chrisimos has several drawbacks regarding efficiency and solution quality. This work presents a new framework for Useful PoW, ScaloWork, that decides the block proposer for the Bitcoin blockchain based on the solution for the dominating set problem. ScaloWork relies on the property of graph isomorphism and guarantees solution extractability. We also propose a distributed approach for calculating the dominating set, allowing miners to collaborate in a pool. This enables ScaloWork to handle larger graphs relevant to real-life applications, thereby enhancing scalability. Our framework also eliminates the problem of free-riders, ensuring fairness in the distribution of block rewards. We perform a detailed security analysis of our framework and prove our scheme as secure as hash-based PoW. We implement a prototype of our framework, and the results show that our system outperforms Chrisimos in all aspects.

Title: Visual Prompting for One-shot Controllable Video Editing without Inversion

Authors: Zhengbo Zhang, Yuxi Zhou, Duo Peng, Joo-Hwee Lim, Zhigang Tu, De Wen Soh, Lin Geng Foo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14335
Pdf URL: https://arxiv.org/pdf/2504.14335
Copy Paste: [[2504.14335]] Visual Prompting for One-shot Controllable Video Editing without Inversion(https://arxiv.org/abs/2504.14335)
Keywords: diffusion
Abstract: One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.

Title: Multispectral airborne laser scanning for tree species classification: a benchmark of machine learning and deep learning algorithms

Authors: Josef Taher, Eric Hyyppä, Matti Hyyppä, Klaara Salolahti, Xiaowei Yu, Leena Matikainen, Antero Kukko, Matti Lehtomäki, Harri Kaartinen, Sopitta Thurachen, Paula Litkey, Ville Luoma, Markus Holopainen, Gefei Kong, Hongchao Fan, Petri Rönnholm, Antti Polvivaara, Samuli Junttila, Mikko Vastaranta, Stefano Puliti, Rasmus Astrup, Joel Kostensalo, Mari Myllymäki, Maksymilian Kulicki, Krzysztof Stereńczak, Raul de Paula Pires, Ruben Valbuena, Juan Pedro Carbonell-Rivera, Jesús Torralba, Yi-Chen Chen, Lukas Winiwarter, Markus Hollaus, Gottfried Mandlburger, Narges Takhtkeshha, Fabio Remondino, Maciej Lisiewicz, Bartłomiej Kraszewski, Xinlian Liang, Jianchang Chen, Eero Ahokas, Kirsi Karila, Eugeniu Vezeteu, Petri Manninen, Roope Näsi, Heikki Hyyti, Siiri Pyykkönen, Peilun Hu, Juha Hyyppä
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14337
Pdf URL: https://arxiv.org/pdf/2504.14337
Copy Paste: [[2504.14337]] Multispectral airborne laser scanning for tree species classification: a benchmark of machine learning and deep learning algorithms(https://arxiv.org/abs/2504.14337)
Keywords: transformer, segmentation
Abstract: Climate-smart and biodiversity-preserving forestry demands precise information on forest resources, extending to the individual tree level. Multispectral airborne laser scanning (ALS) has shown promise in automated point cloud processing and tree segmentation, but challenges remain in identifying rare tree species and leveraging deep learning techniques. This study addresses these gaps by conducting a comprehensive benchmark of machine learning and deep learning methods for tree species classification. For the study, we collected high-density multispectral ALS data (>1000 pts/m$^2$) at three wavelengths using the FGI-developed HeliALS system, complemented by existing Optech Titan data (35 pts/m$^2$), to evaluate the species classification accuracy of various algorithms in a test site located in Southern Finland. Based on 5261 test segments, our findings demonstrate that point-based deep learning methods, particularly a point transformer model, outperformed traditional machine learning and image-based deep learning approaches on high-density multispectral point clouds. For the high-density ALS dataset, a point transformer model provided the best performance reaching an overall (macro-average) accuracy of 87.9% (74.5%) with a training set of 1065 segments and 92.0% (85.1%) with 5000 training segments. The best image-based deep learning method, DetailView, reached an overall (macro-average) accuracy of 84.3% (63.9%), whereas a random forest (RF) classifier achieved an overall (macro-average) accuracy of 83.2% (61.3%). Importantly, the overall classification accuracy of the point transformer model on the HeliALS data increased from 73.0% with no spectral information to 84.7% with single-channel reflectance, and to 87.9% with spectral information of all the three channels.

Title: Manipulating Multimodal Agents via Cross-Modal Prompt Injection

Authors: Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14348
Pdf URL: https://arxiv.org/pdf/2504.14348
Copy Paste: [[2504.14348]] Manipulating Multimodal Agents via Cross-Modal Prompt Injection(https://arxiv.org/abs/2504.14348)
Keywords: security, attack, generative, large language model
Abstract: The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this work, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision-making process and execute unauthorized tasks. Our approach consists of two key components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to infer the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms existing injection attacks, achieving at least a +26.4% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.

Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay

Authors: Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14363
Pdf URL: https://arxiv.org/pdf/2504.14363
Copy Paste: [[2504.14363]] Improving RL Exploration for LLM Reasoning through Retrospective Replay(https://arxiv.org/abs/2504.14363)
Keywords: large language model
Abstract: Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model's ability to revisit and re-explore these ideas later. Consequently, although the LLM's capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.

Title: Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

Authors: Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2504.14365
Pdf URL: https://arxiv.org/pdf/2504.14365
Copy Paste: [[2504.14365]] Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator(https://arxiv.org/abs/2504.14365)
Keywords: transformer, large language model
Abstract: Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: this https URL

Title: Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Authors: Patrick Haller, Jonas Golde, Alan Akbik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14366
Pdf URL: https://arxiv.org/pdf/2504.14366
Copy Paste: [[2504.14366]] Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models(https://arxiv.org/abs/2504.14366)
Keywords: transformer, large language model
Abstract: Knowledge distillation is a widely used technique for compressing large language models (LLMs) by training a smaller student model to mimic a larger teacher model. Typically, both the teacher and student are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention at inference time remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures. Our study aims to determine which subquadratic model best aligns with the teacher's learned representations and how different architectural constraints influence the distillation process. We also investigate the impact of intelligent initialization strategies, including matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

Title: Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites

Authors: Gabriel Machado Santos, Rita Maria da Silva Julia, Marcelo Zanchetta do Nascimento
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14367
Pdf URL: https://arxiv.org/pdf/2504.14367
Copy Paste: [[2504.14367]] Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites(https://arxiv.org/abs/2504.14367)
Keywords: large language model
Abstract: Prompt engineering is essential for optimizing large language models (LLMs), yet the link between prompt structures and task performance remains underexplored. This work introduces an evolutionary approach that combines context-free grammar (CFG) with the MAP-Elites algorithm to systematically explore the prompt space. Our method prioritizes quality and diversity, generating high-performing and structurally varied prompts while analyzing their alignment with diverse tasks by varying traits such as the number of examples (shots) and reasoning depth. By systematically mapping the phenotypic space, we reveal how structural variations influence LLM performance, offering actionable insights for task-specific and adaptable prompt design. Evaluated on seven BigBench Lite tasks across multiple LLMs, our results underscore the critical interplay of quality and diversity, advancing the effectiveness and versatility of LLMs.

Title: Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data

Authors: Shlomi Hod, Lucas Rosenblatt, Julia Stoyanovich
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.14368
Pdf URL: https://arxiv.org/pdf/2504.14368
Copy Paste: [[2504.14368]] Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data(https://arxiv.org/abs/2504.14368)
Keywords: privacy, large language model
Abstract: Differentially private (DP) machine learning often relies on the availability of public data for tasks like privacy-utility trade-off estimation, hyperparameter tuning, and pretraining. While public data assumptions may be reasonable in text and image domains, they are less likely to hold for tabular data due to tabular data heterogeneity across domains. We propose leveraging powerful priors to address this limitation; specifically, we synthesize realistic tabular data directly from schema-level specifications - such as variable names, types, and permissible ranges - without ever accessing sensitive records. To that end, this work introduces the notion of "surrogate" public data - datasets generated independently of sensitive data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata. Surrogate public data are intended to encode plausible statistical assumptions (informed by publicly available information) into a dataset with many downstream uses in private mechanisms. We automate the process of generating surrogate public data with large language models (LLMs); in particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records. Through extensive experiments, we demonstrate that surrogate public tabular data can effectively replace traditional public data when pretraining differentially private tabular classifiers. To a lesser extent, surrogate public data are also useful for hyperparameter tuning of DP synthetic data generators, and for estimating the privacy-utility tradeoff.

Title: Efficient Spiking Point Mamba for Point Cloud Analysis

Authors: Peixi Wu, Bosong Chai, Menghua Zheng, Wei Li, Zhangchi Hu, Jie Chen, Zheyu Zhang, Hebei Li, Xiaoyan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14371
Pdf URL: https://arxiv.org/pdf/2504.14371
Copy Paste: [[2504.14371]] Efficient Spiking Point Mamba for Point Cloud Analysis(https://arxiv.org/abs/2504.14371)
Keywords: extraction
Abstract: Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.

Title: Bottom-Up Synthesis of Knowledge-Grounded Task-Oriented Dialogues with Iteratively Self-Refined Prompts

Authors: Kun Qian, Maximillian Chen, Siyan Li, Arpit Sharma, Zhou Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14375
Pdf URL: https://arxiv.org/pdf/2504.14375
Copy Paste: [[2504.14375]] Bottom-Up Synthesis of Knowledge-Grounded Task-Oriented Dialogues with Iteratively Self-Refined Prompts(https://arxiv.org/abs/2504.14375)
Keywords: large language model
Abstract: Training conversational question-answering (QA) systems requires a substantial amount of in-domain data, which is often scarce in practice. A common solution to this challenge is to generate synthetic data. Traditional methods typically follow a top-down approach, where a large language model (LLM) generates multi-turn dialogues from a broad prompt. Although this method produces coherent conversations, it offers limited fine-grained control over the content and is susceptible to hallucinations. We introduce a bottom-up conversation synthesis approach, where QA pairs are generated first and then combined into a coherent dialogue. This method offers greater control and precision by dividing the process into two distinct steps, allowing refined instructions and validations to be handled separately. Additionally, this structure allows the use of non-local models in stages that do not involve proprietary knowledge, enhancing the overall quality of the generated data. Both human and automated evaluations demonstrate that our approach produces more realistic and higher-quality dialogues compared to top-down methods.

Title: Publicly Verifiable Secret Sharing: Generic Constructions and Lattice-Based Instantiations in the Standard Model

Authors: Pham Nhat Minh, Khoa Nguyen, Willy Susilo, Khuong Nguyen-An
Subjects: cs.CR, cs.IT, math.NT
Abstract URL: https://arxiv.org/abs/2504.14381
Pdf URL: https://arxiv.org/pdf/2504.14381
Copy Paste: [[2504.14381]] Publicly Verifiable Secret Sharing: Generic Constructions and Lattice-Based Instantiations in the Standard Model(https://arxiv.org/abs/2504.14381)
Keywords: security
Abstract: Publicly verifiable secret sharing (PVSS) allows a dealer to share a secret among a set of shareholders so that the secret can be reconstructed later from any set of qualified participants. In addition, any public verifier should be able to check the correctness of the sharing and reconstruction process. PVSS has been demonstrated to yield various applications, such as e-voting, distributed key generation, decentralized random number generation protocols, and multi-party computation. Although many concrete PVSS protocols have been proposed, their security is either proven in the random oracle model or relies on quantum-vulnerable assumptions such as factoring or discrete logarithm. In this work, we put forward a generic construction for PVSS that can be instantiated in the standard model under the Learning With Errors (LWE) assumption. Our instantiation provides the first post-quantum PVSS in the standard model, with a reasonable level of asymptotic efficiency.

Title: LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Authors: Md Abtahi Majeed Chowdhury, Md Rifat Ur Rahman, Akil Ahmad Taki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14386
Pdf URL: https://arxiv.org/pdf/2504.14386
Copy Paste: [[2504.14386]] LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers(https://arxiv.org/abs/2504.14386)
Keywords: transformer
Abstract: Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown theoretical advantages over relative positional embeddings (RPE), particularly due to the ability of sinusoidal functions to preserve spatial inductive biases like monotonicity and shift invariance, a fundamental challenge arises when mapping a 2D grid to a 1D sequence. Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings. To address this, we propose LOOPE, a learnable patch-ordering method that optimizes spatial representation for a given set of frequencies, providing a principled approach to patch order optimization. Empirical results show that our PE significantly improves classification accuracy across various ViT architectures. To rigorously evaluate the effectiveness of positional embeddings, we introduce the "Three Cell Experiment", a novel benchmarking framework that assesses the ability of PEs to retain relative and absolute positional information across different ViT architectures. Unlike standard evaluations, which typically report a performance gap of 4 to 6% between models with and without PE, our method reveals a striking 30 to 35% difference, offering a more sensitive diagnostic tool to measure the efficacy of PEs. Our experimental analysis confirms that the proposed LOOPE demonstrates enhanced effectiveness in retaining both relative and absolute positional information.

Title: Balancing Fairness and Performance in Healthcare AI: A Gradient Reconciliation Approach

Authors: Xiaoyang Wang, Christopher C. Yang
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14388
Pdf URL: https://arxiv.org/pdf/2504.14388
Copy Paste: [[2504.14388]] Balancing Fairness and Performance in Healthcare AI: A Gradient Reconciliation Approach(https://arxiv.org/abs/2504.14388)
Keywords: fair
Abstract: The rapid growth of healthcare data and advances in computational power have accelerated the adoption of artificial intelligence (AI) in medicine. However, AI systems deployed without explicit fairness considerations risk exacerbating existing healthcare disparities, potentially leading to inequitable resource allocation and diagnostic disparities across demographic subgroups. To address this challenge, we propose FairGrad, a novel gradient reconciliation framework that automatically balances predictive performance and multi-attribute fairness optimization in healthcare AI models. Our method resolves conflicting optimization objectives by projecting each gradient vector onto the orthogonal plane of the others, thereby regularizing the optimization trajectory to ensure equitable consideration of all objectives. Evaluated on diverse real-world healthcare datasets and predictive tasks - including Substance Use Disorder (SUD) treatment and sepsis mortality - FairGrad achieved statistically significant improvements in multi-attribute fairness metrics (e.g., equalized odds) while maintaining competitive predictive accuracy. These results demonstrate the viability of harmonizing fairness and utility in mission-critical medical AI applications.

Title: Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models

Authors: Chung-En (Johnny)Yu, Hsuan-Chih (Neil)Chen, Brian Jalaian, Nathaniel D. Bastian
Subjects: cs.CV, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2504.14395
Pdf URL: https://arxiv.org/pdf/2504.14395
Copy Paste: [[2504.14395]] Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models(https://arxiv.org/abs/2504.14395)
Keywords: defense, attack, robust
Abstract: To develop trustworthy Vision-Language Models (VLMs), it is essential to address adversarial robustness and hallucination mitigation, both of which impact factual accuracy in high-stakes applications such as defense and healthcare. Existing methods primarily focus on either adversarial defense or hallucination post-hoc correction, leaving a gap in unified robustness strategies. We introduce \textbf{Hydra}, an adaptive agentic framework that enhances plug-in VLMs through iterative reasoning, structured critiques, and cross-model verification, improving both resilience to adversarial perturbations and intrinsic model errors. Hydra employs an Action-Critique Loop, where it retrieves and critiques visual information, leveraging Chain-of-Thought (CoT) and In-Context Learning (ICL) techniques to refine outputs dynamically. Unlike static post-hoc correction methods, Hydra adapts to both adversarial manipulations and intrinsic model errors, making it robust to malicious perturbations and hallucination-related inaccuracies. We evaluate Hydra on four VLMs, three hallucination benchmarks, two adversarial attack strategies, and two adversarial defense methods, assessing performance on both clean and adversarial inputs. Results show that Hydra surpasses plug-in VLMs and state-of-the-art (SOTA) dehallucination methods, even without explicit adversarial defenses, demonstrating enhanced robustness and factual consistency. By bridging adversarial resistance and hallucination mitigation, Hydra provides a scalable, training-free solution for improving the reliability of VLMs in real-world applications.

Title: SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

Authors: Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14396
Pdf URL: https://arxiv.org/pdf/2504.14396
Copy Paste: [[2504.14396]] SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation(https://arxiv.org/abs/2504.14396)
Keywords: robust, diffusion
Abstract: The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. this https URL

Title: Exploring Pseudo-Token Approaches in Transformer Neural Processes

Authors: Jose Lara-Rangel, Nanze Chen, Fengzhe Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14416
Pdf URL: https://arxiv.org/pdf/2504.14416
Copy Paste: [[2504.14416]] Exploring Pseudo-Token Approaches in Transformer Neural Processes(https://arxiv.org/abs/2504.14416)
Keywords: transformer
Abstract: Neural Processes (NPs) have gained attention in meta-learning for their ability to quantify uncertainty, together with their rapid prediction and adaptability. However, traditional NPs are prone to underfitting. Transformer Neural Processes (TNPs) significantly outperform existing NPs, yet their applicability in real-world scenarios is hindered by their quadratic computational complexity relative to both context and target data points. To address this, pseudo-token-based TNPs (PT-TNPs) have emerged as a novel NPs subset that condense context data into latent vectors or pseudo-tokens, reducing computational demands. We introduce the Induced Set Attentive Neural Processes (ISANPs), employing Induced Set Attention and an innovative query phase to improve querying efficiency. Our evaluations show that ISANPs perform competitively with TNPs and often surpass state-of-the-art models in 1D regression, image completion, contextual bandits, and Bayesian optimization. Crucially, ISANPs offer a tunable balance between performance and computational complexity, which scale well to larger datasets where TNPs face limitations.

Title: How Do Mobile Applications Enhance Security? An Exploratory Analysis of Use Cases and Provided Information

Authors: Irdin Pekaric, Clemens Sauerwein, Simon Laichner, Ruth Breu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14421
Pdf URL: https://arxiv.org/pdf/2504.14421
Copy Paste: [[2504.14421]] How Do Mobile Applications Enhance Security? An Exploratory Analysis of Use Cases and Provided Information(https://arxiv.org/abs/2504.14421)
Keywords: security, privacy, attack
Abstract: The ubiquity of mobile applications has increased dramatically in recent years, opening up new opportunities for cyber attackers and heightening security concerns in the mobile ecosystem. As a result, researchers and practitioners have intensified their research into improving the security and privacy of mobile applications. At the same time, more and more mobile applications have appeared on the market that address the aforementioned security issues. However, both academia and industry currently lack a comprehensive overview of these mobile security applications for Android and iOS platforms, including their respective use cases and the security information they provide. To address this gap, we systematically collected a total of 410 mobile applications from both the App and Play Store. Then, we identified the 20 most widely utilized mobile security applications on both platforms that were analyzed and classified. Our results show six primary use cases and a wide range of security information provided by these applications, thus supporting the core functionalities for ensuring mobile security.

Title: Adversarial Attack for RGB-Event based Visual Object Tracking

Authors: Qiang Chen, Xiao Wang, Haowen Wang, Bo Jiang, Lin Zhu, Dawei Zhang, Yonghong Tian, Jin Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14423
Pdf URL: https://arxiv.org/pdf/2504.14423
Copy Paste: [[2504.14423]] Adversarial Attack for RGB-Event based Visual Object Tracking(https://arxiv.org/abs/2504.14423)
Keywords: defense, attack, robust
Abstract: Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on this https URL

Title: ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14429
Pdf URL: https://arxiv.org/pdf/2504.14429
Copy Paste: [[2504.14429]] ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations(https://arxiv.org/abs/2504.14429)
Keywords: large language model
Abstract: Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.

Title: ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14432
Pdf URL: https://arxiv.org/pdf/2504.14432
Copy Paste: [[2504.14432]] ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task(https://arxiv.org/abs/2504.14432)
Keywords: large language model
Abstract: In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

Title: Application of Deep Reinforcement Learning for Intrusion Detection in Internet of Things: A Systematic Review

Authors: Saeid Jamshidia, Amin Nikanjama, Kawser Wazed Nafia, Foutse Khomha, Rasoul Rastab
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14436
Pdf URL: https://arxiv.org/pdf/2504.14436
Copy Paste: [[2504.14436]] Application of Deep Reinforcement Learning for Intrusion Detection in Internet of Things: A Systematic Review(https://arxiv.org/abs/2504.14436)
Keywords: security, protect
Abstract: The Internet of Things (IoT) has significantly expanded the digital landscape, interconnecting an unprecedented array of devices, from home appliances to industrial equipment. This growth enhances functionality, e.g., automation, remote monitoring, and control, and introduces substantial security challenges, especially in defending these devices against cyber threats. Intrusion Detection Systems (IDS) are crucial for securing IoT; however, traditional IDS often struggle to adapt to IoT networks' dynamic and evolving nature and threat patterns. A potential solution is using Deep Reinforcement Learning (DRL) to enhance IDS adaptability, enabling them to learn from and react to their operational environment dynamically. This systematic review examines the application of DRL to enhance IDS in IoT settings, covering research from the past ten years. This review underscores the state-of-the-art DRL techniques employed to improve adaptive threat detection and real-time security across IoT domains by analyzing various studies. Our findings demonstrate that DRL significantly enhances IDS capabilities by enabling systems to learn and adapt from their operational environment. This adaptability allows IDS to improve threat detection accuracy and minimize false positives, making it more effective in identifying genuine threats while reducing unnecessary alerts. Additionally, this systematic review identifies critical research gaps and future research directions, emphasizing the necessity for more diverse datasets, enhanced reproducibility, and improved integration with emerging IoT technologies. This review aims to foster the development of dynamic and adaptive IDS solutions essential for protecting IoT networks against sophisticated cyber threats.

Title: LoRe: Personalizing LLMs via Low-Rank Reward Modeling

Authors: Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14439
Pdf URL: https://arxiv.org/pdf/2504.14439
Copy Paste: [[2504.14439]] LoRe: Personalizing LLMs via Low-Rank Reward Modeling(https://arxiv.org/abs/2504.14439)
Keywords: large language model
Abstract: Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

Title: WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Authors: Mingya Zhang, Liang Wang, Limei Gu, Tingsheng Ling, Xianping Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14445
Pdf URL: https://arxiv.org/pdf/2504.14445
Copy Paste: [[2504.14445]] WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation(https://arxiv.org/abs/2504.14445)
Keywords: segmentation
Abstract: Semi-supervised medical image segmentation (SSMIS) shows promise in reducing reliance on scarce labeled medical data. However, SSMIS field confronts challenges such as distribution mismatches between labeled and unlabeled data, artificial perturbations causing training biases, and inadequate use of raw image information, especially low-frequency (LF) and high-frequency (HF) this http URL address these challenges, we propose a Wavelet Transform based Bidirectional Copy-Paste SSMIS framework, named WT-BCP, which improves upon the Mean Teacher approach. Our method enhances unlabeled data understanding by copying random crops between labeled and unlabeled images and employs WT to extract LF and HF this http URL propose a multi-input and multi-output model named XNet-Plus, to receive the fused information after WT. Moreover, consistency training among multiple outputs helps to mitigate learning biases introduced by artificial perturbations. During consistency training, the mixed images resulting from WT are fed into both models, with the student model's output being supervised by pseudo-labels and ground-truth. Extensive experiments conducted on 2D and 3D datasets confirm the effectiveness of our this http URL: this https URL.

Title: Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

Authors: Carlos Caetano, Gabriel O. dos Santos, Caio Petrucci, Artur Barros, Camila Laranjeira, Leo S. F. Ribeiro, Júlia F. de Mendonça, Jefersson A. dos Santos, Sandra Avila
Subjects: cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14446
Pdf URL: https://arxiv.org/pdf/2504.14446
Copy Paste: [[2504.14446]] Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability(https://arxiv.org/abs/2504.14446)
Keywords: privacy, protect
Abstract: Including children's images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. Despite the growing recognition of these issues, approaches for addressing them remain limited. We explore the ethical implications of using children's images in AI datasets and propose a pipeline to detect and remove such images. As a use case, we built the pipeline on a Vision-Language Model under the Visual Question Answering task and tested it on the #PraCegoVer dataset. We also evaluate the pipeline on a subset of 100,000 images from the Open Images V7 dataset to assess its effectiveness in detecting and removing images of children. The pipeline serves as a baseline for future research, providing a starting point for more comprehensive tools and methodologies. While we leverage existing models trained on potentially problematic data, our goal is to expose and address this issue. We do not advocate for training or deploying such models, but instead call for urgent community reflection and action to protect children's rights. Ultimately, we aim to encourage the research community to exercise - more than an additional - care in creating new datasets and to inspire the development of tools to protect the fundamental rights of vulnerable groups, particularly children.

Title: Causal Disentanglement for Robust Long-tail Medical Image Generation

Authors: Weizhi Nie, Zichun Zhang, Weijie Wang, Bruno Lepri, Anan Liu, Nicu Seb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14450
Pdf URL: https://arxiv.org/pdf/2504.14450
Copy Paste: [[2504.14450]] Causal Disentanglement for Robust Long-tail Medical Image Generation(https://arxiv.org/abs/2504.14450)
Keywords: robust, interpretability, diffusion, large language model
Abstract: Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.

Title: ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Authors: Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14452
Pdf URL: https://arxiv.org/pdf/2504.14452
Copy Paste: [[2504.14452]] ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data(https://arxiv.org/abs/2504.14452)
Keywords: privacy
Abstract: Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

Title: CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Authors: Armin Toroghi, Willis Guo, Scott Sanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14462
Pdf URL: https://arxiv.org/pdf/2504.14462
Copy Paste: [[2504.14462]] CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge(https://arxiv.org/abs/2504.14462)
Keywords: robust, large language model
Abstract: The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

Title: LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

Authors: Jiachen Li, Qing Xie, Xiaohan Yu, Hongyun Wang, Jinyu Xu, Yongjian Liu, Yongsheng Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14467
Pdf URL: https://arxiv.org/pdf/2504.14467
Copy Paste: [[2504.14467]] LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation(https://arxiv.org/abs/2504.14467)
Keywords: generative, large language model, segmentation
Abstract: Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression, with the primary challenge of aligning and matching semantics across visual and textual modalities without training. Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching. However, this paradigm may lead to incorrect target localization due to the inherent ambiguity and diversity of free-form referring expressions. To alleviate this issue, we present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models to enhance region-text matching performance in Vision-Language Models. Specifically, we first design two kinds of prompts, the attribute prompt and the surrounding prompt, to guide the Multi-Modal Large Language Models in generating descriptions related to the crucial attributes of the referent object and the details of surrounding objects, referred to as attribute description and surrounding description, respectively. Secondly, three visual-text matching scores are introduced to evaluate the similarity between instance-level visual features and textual features, which determines the mask most associated with the referring expression. The proposed method achieves new state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU compared to previous methods.

Title: Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Authors: Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14470
Pdf URL: https://arxiv.org/pdf/2504.14470
Copy Paste: [[2504.14470]] Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis(https://arxiv.org/abs/2504.14470)
Keywords: diffusion, transformer, generative
Abstract: Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.

Title: Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation

Authors: Guoyi Zhang, Siyang Chen, Guangsheng Xu, Han Wang, Xiaohu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14481
Pdf URL: https://arxiv.org/pdf/2504.14481
Copy Paste: [[2504.14481]] Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation(https://arxiv.org/abs/2504.14481)
Keywords: robust, segmentation
Abstract: Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios, such as camouflage and infrared imagery. We attribute this challenge to the inherent texture bias in VFMs, which is exacerbated during fine-tuning and limits generalization in texture-sparse environments. To address this, we propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors. LSR-ST captures shape-aware features using a simple HDConv Block, which integrates large-kernel attention and residual learning. The method satisfies three key conditions for inducing shape bias: large receptive fields, multi-order feature interactions, and sparse connectivity. Our analysis reveals that these improvements stem from representation efficiency-the ability to extract task-relevant, structurally grounded features while minimizing redundancy. We formalize this concept via Information Bottleneck theory and advocate for it as a key PEFT objective. Unlike traditional NLP paradigms that focus on optimizing parameters and memory, visual tasks require models that extract task-defined semantics, rather than just relying on pre-encoded features. This shift enables our approach to move beyond conventional trade-offs, offering more robust and generalizable solutions for vision tasks. With minimal changes to SAM2-UNet, LSR-ST achieves consistent improvements across 17 datasets and 6 tasks using only 4.719M trainable parameters. These results highlight the potential of representation efficiency for robust and adaptable VFMs within complex visual environments.

Title: STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking

Authors: Shang Zhang, Xiaobo Ding, Huanbin Zhang, Ruoyan Xiong, Yue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14491
Pdf URL: https://arxiv.org/pdf/2504.14491
Copy Paste: [[2504.14491]] STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking(https://arxiv.org/abs/2504.14491)
Keywords: robust
Abstract: Thermal infrared (TIR) target tracking methods often adopt the correlation filter (CF) framework due to its computational efficiency. However, the low resolution of TIR images, along with tracking interference, significantly limits the perfor-mance of TIR trackers. To address these challenges, we introduce STARS, a novel sparse learning-based CF tracker that incorporates spatio-temporal regulari-zation and super-resolution reconstruction. First, we apply adaptive sparse filter-ing and temporal domain filtering to extract key features of the target while reduc-ing interference from background clutter and noise. Next, we introduce an edge-preserving sparse regularization method to stabilize target features and prevent excessive blurring. This regularization integrates multiple terms and employs the alternating direction method of multipliers to optimize the solution. Finally, we propose a gradient-enhanced super-resolution method to extract fine-grained TIR target features and improve the resolution of TIR images, addressing performance degradation in tracking caused by low-resolution sequences. To the best of our knowledge, STARS is the first to integrate super-resolution methods within a sparse learning-based CF framework. Extensive experiments on the LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that STARS outperforms state-of-the-art trackers in terms of robustness.

Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Authors: Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14492
Pdf URL: https://arxiv.org/pdf/2504.14492
Copy Paste: [[2504.14492]] FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering(https://arxiv.org/abs/2504.14492)
Keywords: fair, large language model
Abstract: Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.

Title: Functional Abstraction of Knowledge Recall in Large Language Models

Authors: Zijian Wang, Chang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14496
Pdf URL: https://arxiv.org/pdf/2504.14496
Copy Paste: [[2504.14496]] Functional Abstraction of Knowledge Recall in Large Language Models(https://arxiv.org/abs/2504.14496)
Keywords: transformer, large language model
Abstract: Pre-trained transformer large language models (LLMs) demonstrate strong knowledge recall capabilities. This paper investigates the knowledge recall mechanism in LLMs by abstracting it into a functional structure. We propose that during knowledge recall, the model's hidden activation space implicitly entails a function execution process where specific activation vectors align with functional components (Input argument, Function body, and Return values). Specifically, activation vectors of relation-related tokens define a mapping function from subjects to objects, with subject-related token activations serving as input arguments and object-related token activations as return values. For experimental verification, we first design a patching-based knowledge-scoring algorithm to identify knowledge-aware activation vectors as independent functional components. Then, we conduct counter-knowledge testing to examine the independent functional effects of each component on knowledge recall outcomes. From this functional perspective, we improve the contextual knowledge editing approach augmented by activation patching. By rewriting incoherent activations in context, we enable improved short-term memory retention for new knowledge prompting.

Title: Fast Plaintext-Ciphertext Matrix Multiplication from Additively Homomorphic Encryption

Authors: Krishna Sai Tarun Ramapragada, Utsav Banerjee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14497
Pdf URL: https://arxiv.org/pdf/2504.14497
Copy Paste: [[2504.14497]] Fast Plaintext-Ciphertext Matrix Multiplication from Additively Homomorphic Encryption(https://arxiv.org/abs/2504.14497)
Keywords: secure, privacy
Abstract: Plaintext-ciphertext matrix multiplication (PC-MM) is an indispensable tool in privacy-preserving computations such as secure machine learning and encrypted signal processing. While there are many established algorithms for plaintext-plaintext matrix multiplication, efficiently computing plaintext-ciphertext (and ciphertext-ciphertext) matrix multiplication is an active area of research which has received a lot of attention. Recent literature have explored various techniques for privacy-preserving matrix multiplication using fully homomorphic encryption (FHE) schemes with ciphertext packing and Single Instruction Multiple Data (SIMD) processing. On the other hand, there hasn't been any attempt to speed up PC-MM using unpacked additively homomorphic encryption (AHE) schemes beyond the schoolbook method and Strassen's algorithm for matrix multiplication. In this work, we propose an efficient PC-MM from unpacked AHE, which applies Cussen's compression-reconstruction algorithm for plaintext-plaintext matrix multiplication in the encrypted setting. We experimentally validate our proposed technique using a concrete instantiation with the additively homomorphic elliptic curve ElGamal encryption scheme and its software implementation on a Raspberry Pi 5 edge computing platform. Our proposed approach achieves up to an order of magnitude speedup compared to state-of-the-art for large matrices with relatively small element bit-widths. Extensive measurement results demonstrate that our fast PC-MM is an excellent candidate for efficient privacy-preserving computation even in resource-constrained environments.

Title: Less is More: Adaptive Coverage for Synthetic Training Data

Authors: Sasan Tavakkol, Max Springer, Mohammadhossein Bateni, Neslihan Bulut, Vincent Cohen-Addad, MohammadTaghi Hajiaghayi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14508
Pdf URL: https://arxiv.org/pdf/2504.14508
Copy Paste: [[2504.14508]] Less is More: Adaptive Coverage for Synthetic Training Data(https://arxiv.org/abs/2504.14508)
Keywords: large language model
Abstract: Synthetic training data generation with Large Language Models (LLMs) like Google's Gemma and OpenAI's GPT offer a promising solution to the challenge of obtaining large, labeled datasets for training classifiers. When rapid model deployment is critical, such as in classifying emerging social media trends or combating new forms of online abuse tied to current events, the ability to generate training data is invaluable. While prior research has examined the comparability of synthetic data to human-labeled data, this study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset. This "less is more" approach not only improves accuracy but also reduces the volume of data required, leading to potentially more efficient model fine-tuning.

Title: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

Authors: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14509
Pdf URL: https://arxiv.org/pdf/2504.14509
Copy Paste: [[2504.14509]] DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning(https://arxiv.org/abs/2504.14509)
Keywords: robust, diffusion
Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

Title: On Dimension-Free Transformer: An Application of STP to AI

Authors: Daizhan Cheng
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2504.14514
Pdf URL: https://arxiv.org/pdf/2504.14514
Copy Paste: [[2504.14514]] On Dimension-Free Transformer: An Application of STP to AI(https://arxiv.org/abs/2504.14514)
Keywords: transformer
Abstract: The matrix expressions for every parts of a transformer are firstly described. Based on semi-tensor product (STP) of matrices the hypervectors are reconsidered and the linear transformation over hypervectors is constructed by using projection. Its properties and calculating formulas are obtained. Using projection-based transformation of hypervector (PBTH), the framework of dimension-free transformer (DFT) is proposed by verifying each linear transformation in a transformer and replacing it by a proper PBTH, which allows the inputs and outputs being of arbitrary dimensions. Using balanced information about all entries, DFT must be more efficient in dealing with signals.

Title: Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Authors: Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14516
Pdf URL: https://arxiv.org/pdf/2504.14516
Copy Paste: [[2504.14516]] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction(https://arxiv.org/abs/2504.14516)
Keywords: robust
Abstract: Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

Title: SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Authors: Zhouyang Li, Yuliang Liu, Wei Zhang, Tailing Yuan, Bin Chen, Chengru Song, Di Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14519
Pdf URL: https://arxiv.org/pdf/2504.14519
Copy Paste: [[2504.14519]] SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training(https://arxiv.org/abs/2504.14519)
Keywords: large language model
Abstract: Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, primarily due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency. To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward (1F1B) schedule. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves (1) near-zero memory overhead and (2) minimal pipeline bubbles simultaneously. The effectiveness of SlimPipe has been proven by thorough testing with diverse model architectures, context window sizes, and SlimPipe-specific configurations. For example, on the Llama 70B model, compared to state-of-the-art methods, SlimPipe significantly boosts the Model FLOPs Utilization (MFU) to up to $1.57\times$ for a context length of 512K. More notably, for a context length of 2048K, it maintains over 45% utilization on 256 NVIDIA Hopper 80GB GPUs, while other approaches either suffer significant performance drops or fail entirely due to memory constraints.

Title: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Authors: Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, Feng Guo
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14526
Pdf URL: https://arxiv.org/pdf/2504.14526
Copy Paste: [[2504.14526]] Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding(https://arxiv.org/abs/2504.14526)
Keywords: robust, large language model
Abstract: Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: this https URL.

Title: Causality for Natural Language Processing

Authors: Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14530
Pdf URL: https://arxiv.org/pdf/2504.14530
Copy Paste: [[2504.14530]] Causality for Natural Language Processing(https://arxiv.org/abs/2504.14530)
Keywords: large language model
Abstract: Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems aiming to achieve advanced understanding and decision-making. This thesis delves into various dimensions of causal reasoning and understanding in large language models (LLMs). It encompasses a series of studies that explore the causal inference skills of LLMs, the mechanisms behind their performance, and the implications of causal and anticausal learning for natural language processing (NLP) tasks. Additionally, it investigates the application of causal reasoning in text-based computational social science, specifically focusing on political decision-making and the evaluation of scientific impact through citations. Through novel datasets, benchmark tasks, and methodological frameworks, this work identifies key challenges and opportunities to improve the causal capabilities of LLMs, providing a comprehensive foundation for future research in this evolving field.

Title: SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

Authors: Liang Peng, Boxi Wu, Haoran Cheng, Yibo Zhao, Xiaofei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14534
Pdf URL: https://arxiv.org/pdf/2504.14534
Copy Paste: [[2504.14534]] SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization(https://arxiv.org/abs/2504.14534)
Keywords: diffusion
Abstract: Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{this https URL}{this link}.

Title: FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

Authors: Kuanting Wu, Kei Ota, Asako Kanezaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14535
Pdf URL: https://arxiv.org/pdf/2504.14535
Copy Paste: [[2504.14535]] FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models(https://arxiv.org/abs/2504.14535)
Keywords: diffusion, generative
Abstract: Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.

Title: BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Authors: Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14538
Pdf URL: https://arxiv.org/pdf/2504.14538
Copy Paste: [[2504.14538]] BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation(https://arxiv.org/abs/2504.14538)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld's design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: this https URL.

Title: Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation

Authors: Yi Yu, Song Xia, Xun Lin, Chenqi Kong, Wenhan Yang, Shijian Lu, Yap-Peng Tan, Alex C. Kot
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14541
Pdf URL: https://arxiv.org/pdf/2504.14541
Copy Paste: [[2504.14541]] Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation(https://arxiv.org/abs/2504.14541)
Keywords: defense, attack, robust
Abstract: Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions. A critical aspect of these examples is their transferability, allowing them to deceive {unseen} models in black-box scenarios. Despite the widespread exploration of defense methods, including those on transferability, they show limitations: inefficient deployment, ineffective defense, and degraded performance on clean images. In this work, we introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way. We propose a model that exhibits random guessing behavior when presented with clean data $\boldsymbol{x}$ as input, and generates accurate predictions when with triggered data $\boldsymbol{x}+\boldsymbol{\tau}$. Importantly, the trigger $\boldsymbol{\tau}$ remains constant for all data instances. We refer to these models as \textbf{models with trigger activation}. We are surprised to find that these models exhibit certain robustness against TAEs. Through the consideration of first-order gradients, we provide a theoretical analysis of this robustness. Moreover, through the joint optimization of the learnable trigger and the model, we achieve improved robustness to transferable attacks. Extensive experiments conducted across diverse datasets, evaluating a variety of attacking methods, underscore the effectiveness and superiority of our approach.

Title: VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control

Authors: Lifeng Lin, Rongfeng Lu, Quan Chen, Haofan Ren, Ming Lu, Yaoqi Sun, Chenggang Yan, Anke Xue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14548
Pdf URL: https://arxiv.org/pdf/2504.14548
Copy Paste: [[2504.14548]] VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control(https://arxiv.org/abs/2504.14548)
Keywords: generative
Abstract: Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control (VGNC) approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine the optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussian points. This reduction lowers storage demands and accelerates both training and rendering. The code will be released.

Title: REDEditing: Relationship-Driven Precise Backdoor Poisoning on Text-to-Image Diffusion Models

Authors: Chongye Guo, Jinhu Fu, Junfeng Fang, Kun Wang, Guorui Feng
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14554
Pdf URL: https://arxiv.org/pdf/2504.14554
Copy Paste: [[2504.14554]] REDEditing: Relationship-Driven Precise Backdoor Poisoning on Text-to-Image Diffusion Models(https://arxiv.org/abs/2504.14554)
Keywords: security, attack, steal, diffusion, generative, large language model
Abstract: The rapid advancement of generative AI highlights the importance of text-to-image (T2I) security, particularly with the threat of backdoor poisoning. Timely disclosure and mitigation of security vulnerabilities in T2I models are crucial for ensuring the safe deployment of generative models. We explore a novel training-free backdoor poisoning paradigm through model editing, which is recently employed for knowledge updating in large language models. Nevertheless, we reveal the potential security risks posed by model editing techniques to image generation models. In this work, we establish the principles for backdoor attacks based on model editing, and propose a relationship-driven precise backdoor poisoning method, REDEditing. Drawing on the principles of equivalent-attribute alignment and stealthy poisoning, we develop an equivalent relationship retrieval and joint-attribute transfer approach that ensures consistent backdoor image generation through concept rebinding. A knowledge isolation constraint is proposed to preserve benign generation integrity. Our method achieves an 11\% higher attack success rate compared to state-of-the-art approaches. Remarkably, adding just one line of code enhances output naturalness while improving backdoor stealthiness by 24\%. This work aims to heighten awareness regarding this security vulnerability in editable image generation models.

Title: SMTT: Novel Structured Multi-task Tracking with Graph-Regularized Sparse Representation for Robust Thermal Infrared Target Tracking

Authors: Shang Zhang, HuiPan Guan, XiaoBo Ding, Ruoyan Xiong, Yue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14566
Pdf URL: https://arxiv.org/pdf/2504.14566
Copy Paste: [[2504.14566]] SMTT: Novel Structured Multi-task Tracking with Graph-Regularized Sparse Representation for Robust Thermal Infrared Target Tracking(https://arxiv.org/abs/2504.14566)
Keywords: robust
Abstract: Thermal infrared target tracking is crucial in applications such as surveillance, autonomous driving, and military operations. In this paper, we propose a novel tracker, SMTT, which effectively addresses common challenges in thermal infrared imagery, such as noise, occlusion, and rapid target motion, by leveraging multi-task learning, joint sparse representation, and adaptive graph regularization. By reformulating the tracking task as a multi-task learning problem, the SMTT tracker independently optimizes the representation of each particle while dynamically capturing spatial and feature-level similarities using a weighted mixed-norm regularization strategy. To ensure real-time performance, we incorporate the Accelerated Proximal Gradient method for efficient optimization. Extensive experiments on benchmark datasets - including VOT-TIR, PTB-TIR, and LSOTB-TIR - demonstrate that SMTT achieves superior accuracy, robustness, and computational efficiency. These results highlight SMTT as a reliable and high-performance solution for thermal infrared target tracking in complex environments.

Title: NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Authors: Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14569
Pdf URL: https://arxiv.org/pdf/2504.14569
Copy Paste: [[2504.14569]] NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models(https://arxiv.org/abs/2504.14569)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag: (Normalized Weight and Activation Guided Compression), a unified framework for zero-shot shape preserving compression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8/70BB models, using two popular forms of shape-preserving compression, vector quantization NoWag-VQ (NoWag for Vector Quantization), and unstructured/semi-structured pruning NoWag-P (NoWag for Pruning). We found that NoWag-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWag-P performs competitively against state-of-the-art methods. These results suggest commonalities between these compression paradigms that could inspire future work. Our code is available at this https URL

Title: Using street view imagery and deep generative modeling for estimating the health of urban forests

Authors: Akshit Gupta, Remko Uijlenhoet
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14583
Pdf URL: https://arxiv.org/pdf/2504.14583
Copy Paste: [[2504.14583]] Using street view imagery and deep generative modeling for estimating the health of urban forests(https://arxiv.org/abs/2504.14583)
Keywords: generative
Abstract: Healthy urban forests comprising of diverse trees and shrubs play a crucial role in mitigating climate change. They provide several key advantages such as providing shade for energy conservation, and intercepting rainfall to reduce flood runoff and soil erosion. Traditional approaches for monitoring the health of urban forests require instrumented inspection techniques, often involving a high amount of human labor and subjective evaluations. As a result, they are not scalable for cities which lack extensive resources. Recent approaches involving multi-spectral imaging data based on terrestrial sensing and satellites, are constrained respectively with challenges related to dedicated deployments and limited spatial resolutions. In this work, we propose an alternative approach for monitoring the urban forests using simplified inputs: street view imagery, tree inventory data and meteorological conditions. We propose to use image-to-image translation networks to estimate two urban forest health parameters, namely, NDVI and CTD. Finally, we aim to compare the generated results with ground truth data using an onsite campaign utilizing handheld multi-spectral and thermal imaging sensors. With the advent and expansion of street view imagery platforms such as Google Street View and Mapillary, this approach should enable effective management of urban forests for the authorities in cities at scale.

Title: Generative Auto-Bidding with Value-Guided Explorations

Authors: Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qingpeng Cai, Fei Pan, Peng Jiang, Kun Gai, Bo An, Xiangyu Zhao
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2504.14587
Pdf URL: https://arxiv.org/pdf/2504.14587
Copy Paste: [[2504.14587]] Generative Auto-Bidding with Value-Guided Explorations(https://arxiv.org/abs/2504.14587)
Keywords: generative
Abstract: Auto-bidding, with its strong capability to optimize bidding decisions within dynamic and competitive online environments, has become a pivotal strategy for advertising platforms. Existing approaches typically employ rule-based strategies or Reinforcement Learning (RL) techniques. However, rule-based strategies lack the flexibility to adapt to time-varying market conditions, and RL-based methods struggle to capture essential historical dependencies and observations within Markov Decision Process (MDP) frameworks. Furthermore, these approaches often face challenges in ensuring strategy adaptability across diverse advertising objectives. Additionally, as offline training methods are increasingly adopted to facilitate the deployment and maintenance of stable online strategies, the issues of documented behavioral patterns and behavioral collapse resulting from training on fixed offline datasets become increasingly significant. To address these limitations, this paper introduces a novel offline Generative Auto-bidding framework with Value-Guided Explorations (GAVE). GAVE accommodates various advertising objectives through a score-based Return-To-Go (RTG) module. Moreover, GAVE integrates an action exploration mechanism with an RTG-based evaluation method to explore novel actions while ensuring stability-preserving updates. A learnable value function is also designed to guide the direction of action exploration and mitigate Out-of-Distribution (OOD) problems. Experimental results on two offline datasets and real-world deployments demonstrate that GAVE outperforms state-of-the-art baselines in both offline evaluations and online A/B tests. The implementation code is publicly available to facilitate reproducibility and further research.

Title: a1: Steep Test-time Scaling Law via Environment Augmented Generation

Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14597
Pdf URL: https://arxiv.org/pdf/2504.14597
Copy Paste: [[2504.14597]] a1: Steep Test-time Scaling Law via Environment Augmented Generation(https://arxiv.org/abs/2504.14597)
Keywords: large language model
Abstract: Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.

Title: MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

Authors: Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14606
Pdf URL: https://arxiv.org/pdf/2504.14606
Copy Paste: [[2504.14606]] MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation(https://arxiv.org/abs/2504.14606)
Keywords: interpretability
Abstract: Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at this https URL}.

Title: No Imputation of Missing Values In Tabular Data Classification Using Incremental Learning

Authors: Manar D. Samad, Kazi Fuad B. Akhter, Shourav B. Rabbani, Ibna Kowsar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14610
Pdf URL: https://arxiv.org/pdf/2504.14610
Copy Paste: [[2504.14610]] No Imputation of Missing Values In Tabular Data Classification Using Incremental Learning(https://arxiv.org/abs/2504.14610)
Keywords: robust
Abstract: Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often concern data stakeholders about computational complexity, data quality, and data-driven outcomes. This paper eliminates these concerns by proposing no imputation incremental learning (NIIL) of tabular data with varying missing value rates and types. The proposed method incrementally learns partitions of overlapping feature sets while using attention masks to exclude missing values from attention scoring. The average classification performance rank order across 15 diverse tabular data sets highlights the superiority of NIIL over 11 state-of-the-art learning methods with or without missing value imputations. Further experiments substantiate the robustness of NIIL against varying missing value types and rates compared to methods that involve the imputation of missing values. Our empirical analysis reveals that a feature partition size of half of the original feature space is, computation-wise and accuracy-wise, the best choice for the proposed incremental learning. The proposed method is one of the first deep learning solutions that can effectively learn tabular data without requiring the imputation of missing values.

Title: VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image

Authors: Han Bi, Ge Yu, Yu He, Wenzhuo Liu, Zijie Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14618
Pdf URL: https://arxiv.org/pdf/2504.14618
Copy Paste: [[2504.14618]] VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image(https://arxiv.org/abs/2504.14618)
Keywords: extraction
Abstract: Understanding bimanual hand interactions is essential for realistic 3D pose and shape reconstruction. However, existing methods struggle with occlusions, ambiguous appearances, and computational inefficiencies. To address these challenges, we propose Vision Mamba Bimanual Hand Interaction Network (VM-BHINet), introducing state space models (SSMs) into hand reconstruction to enhance interaction modeling while improving computational efficiency. The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations, enabling deep understanding of hand interactions. Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%, significantly surpassing state-of-the-art methods.

Title: Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations

Authors: Yuri Balashov, Alex Balashov, Shiho Fukuda Koski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14619
Pdf URL: https://arxiv.org/pdf/2504.14619
Copy Paste: [[2504.14619]] Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations(https://arxiv.org/abs/2504.14619)
Keywords: large language model
Abstract: This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics -- such as BLEU, chrF, TER, and COMET -- to freelancers' needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.

Title: A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models

Authors: Hongming Tan, Shaoxiong Zhan, Fengwei Jia, Hai-Tao Zheng, Wai Kin Chan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14620
Pdf URL: https://arxiv.org/pdf/2504.14620
Copy Paste: [[2504.14620]] A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models(https://arxiv.org/abs/2504.14620)
Keywords: interpretability, large language model
Abstract: Measuring scientific paper innovation is both important and challenging. Existing content-based methods often overlook the full-paper context, fail to capture the full scope of innovation, and lack generalization. We propose HSPIM, a hierarchical and training-free framework based on large language models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess innovation. We segment the text by section titles and use zero-shot LLM prompting to implement section classification, question-answering (QA) augmentation, and weighted novelty scoring. The generated QA pair focuses on section-level innovation and serves as additional context to improve the LLM scoring. For each chunk, the LLM outputs a novelty score and a confidence score. We use confidence scores as weights to aggregate novelty scores into a paper-level innovation score. To further improve performance, we propose a two-layer question structure consisting of common and section-specific questions, and apply a genetic algorithm to optimize the question-prompt combinations. Comprehensive experiments on scientific conference paper datasets show that HSPIM outperforms baseline methods in effectiveness, generalization, and interpretability.

Title: Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts

Authors: Zhenkui Yang, Zeyi Huang, Ge Wang, Han Ding, Tony Xiao Han, Fei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14621
Pdf URL: https://arxiv.org/pdf/2504.14621
Copy Paste: [[2504.14621]] Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts(https://arxiv.org/abs/2504.14621)
Keywords: security
Abstract: Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in this https URL.

Title: MSAD-Net: Multiscale and Spatial Attention-based Dense Network for Lung Cancer Classification

Authors: Santanu Roy, Shweta Singh, Palak Sahu, Ashvath Suresh, Debashish Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14626
Pdf URL: https://arxiv.org/pdf/2504.14626
Copy Paste: [[2504.14626]] MSAD-Net: Multiscale and Spatial Attention-based Dense Network for Lung Cancer Classification(https://arxiv.org/abs/2504.14626)
Keywords: transformer
Abstract: Lung cancer, a severe form of malignant tumor that originates in the tissues of the lungs, can be fatal if not detected in its early stages. It ranks among the top causes of cancer-related mortality worldwide. Detecting lung cancer manually using chest X-Ray image or Computational Tomography (CT) scans image poses significant challenges for radiologists. Hence, there is a need for automatic diagnosis system of lung cancers from radiology images. With the recent emergence of deep learning, particularly through Convolutional Neural Networks (CNNs), the automated detection of lung cancer has become a much simpler task. Nevertheless, numerous researchers have addressed that the performance of conventional CNNs may be hindered due to class imbalance issue, which is prevalent in medical images. In this research work, we have proposed a novel CNN architecture ``Multi-Scale Dense Network (MSD-Net)'' (trained-from-scratch). The novelties we bring in the proposed model are (I) We introduce novel dense modules in the 4th block and 5th block of the CNN model. We have leveraged 3 depthwise separable convolutional (DWSC) layers, and one 1x1 convolutional layer in each dense module, in order to reduce complexity of the model considerably. (II) Additionally, we have incorporated one skip connection from 3rd block to 5th block and one parallel branch connection from 4th block to Global Average Pooling (GAP) layer. We have utilized dilated convolutional layer (with dilation rate=2) in the last parallel branch in order to extract multi-scale features. Extensive experiments reveal that our proposed model has outperformed latest CNN model ConvNext-Tiny, recent trend Vision Transformer (ViT), Pooling-based ViT (PiT), and other existing models by significant margins.

Title: Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance

Authors: Soo-joon Choi, Ji-jun Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14633
Pdf URL: https://arxiv.org/pdf/2504.14633
Copy Paste: [[2504.14633]] Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance(https://arxiv.org/abs/2504.14633)
Keywords: extraction, generative, large language model
Abstract: Financial event entity extraction is a crucial task for analyzing market dynamics and building financial knowledge graphs, yet it presents significant challenges due to the specialized language and complex structures in financial texts. Traditional approaches often rely on sequence labeling models, which can struggle with long-range dependencies and the inherent complexity of extracting multiple, potentially overlapping entities. Motivated by the advanced language understanding and generative capabilities of Large Language Models (LLMs), we propose a novel method that reframes financial event entity extraction as a text-to-structured-output generation task. Our approach involves fine-tuning a pre-trained LLM using Parameter-Efficient Fine-Tuning (PEFT) to directly generate a structured representation, such as a JSON object, containing the extracted entities and their precise character spans from the input text. We evaluate our method on the challenging CCKS 2019 Financial Event Entity Extraction dataset, comparing its performance against strong sequence labeling baselines, including SEBERTNets and sebertNets. Experimental results demonstrate that our generative LLM method achieves a new state-of-the-art F1 score on this benchmark, significantly outperforming previous methods. Through detailed quantitative analysis across event types, entity types, and instance complexity, as well as human evaluation, we show that our approach is more effective at handling the nuances of financial text and extracting high-quality entities. This work validates the potential of applying generative LLMs directly to complex, domain-specific information extraction tasks requiring structured output.

Title: NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

Authors: Junyuan Fang (1 and 2), Zihan Wang (1), Yejun Zhang (1), Shuzhe Wang (1), Iaroslav Melekhov (1), Juho Kannala (1 and 3) ((1) Aalto University, Espoo, Finland, (2) University of Helsinki, Helsinki, Finland, (3) University of Oulu, Oulu, Finland)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14638
Pdf URL: https://arxiv.org/pdf/2504.14638
Copy Paste: [[2504.14638]] NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation(https://arxiv.org/abs/2504.14638)
Keywords: robust, segmentation
Abstract: Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

Title: Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

Authors: Lin Li, Wei Chen, Jiahui Li, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14642
Pdf URL: https://arxiv.org/pdf/2504.14642
Copy Paste: [[2504.14642]] Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension(https://arxiv.org/abs/2504.14642)
Keywords: large language model
Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.

Title: Surrogate Fitness Metrics for Interpretable Reinforcement Learning

Authors: Philipp Altmann, Céline Davignon, Maximilian Zorn, Fabian Ritz, Claudia Linnhoff-Popien, Thomas Gabor
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14645
Pdf URL: https://arxiv.org/pdf/2504.14645
Copy Paste: [[2504.14645]] Surrogate Fitness Metrics for Interpretable Reinforcement Learning(https://arxiv.org/abs/2504.14645)
Keywords: interpretability, explainability
Abstract: We employ an evolutionary optimization framework that perturbs initial states to generate informative and diverse policy demonstrations. A joint surrogate fitness function guides the optimization by combining local diversity, behavioral certainty, and global population diversity. To assess demonstration quality, we apply a set of evaluation metrics, including the reward-based optimality gap, fidelity interquartile means (IQMs), fitness composition analysis, and trajectory visualizations. Hyperparameter sensitivity is also examined to better understand the dynamics of trajectory optimization. Our findings demonstrate that optimizing trajectory selection via surrogate fitness metrics significantly improves interpretability of RL policies in both discrete and continuous environments. In gridworld domains, evaluations reveal significantly enhanced demonstration fidelities compared to random and ablated baselines. In continuous control, the proposed framework offers valuable insights, particularly for early-stage policies, while fidelity-based optimization proves more effective for mature policies. By refining and systematically analyzing surrogate fitness functions, this study advances the interpretability of RL models. The proposed improvements provide deeper insights into RL decision-making, benefiting applications in safety-critical and explainability-focused domains.

Title: BLACKOUT: Data-Oblivious Computation with Blinded Capabilities

Authors: Hossam ElAtali, Merve Gülmez, Thomas Nyman, N. Asokan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14654
Pdf URL: https://arxiv.org/pdf/2504.14654
Copy Paste: [[2504.14654]] BLACKOUT: Data-Oblivious Computation with Blinded Capabilities(https://arxiv.org/abs/2504.14654)
Keywords: secure
Abstract: Lack of memory-safety and exposure to side channels are two prominent, persistent challenges for the secure implementation of software. Memory-safe programming languages promise to significantly reduce the prevalence of memory-safety bugs, but make it more difficult to implement side-channel-resistant code. We aim to address both memory-safety and side-channel resistance by augmenting memory-safe hardware with the ability for data-oblivious programming. We describe an extension to the CHERI capability architecture to provide blinded capabilities that allow data-oblivious computation to be carried out by userspace tasks. We also present BLACKOUT, our realization of blinded capabilities on a FPGA softcore based on the speculative out-of-order CHERI-Toooba processor and extend the CHERI-enabled Clang/LLVM compiler and the CheriBSD operating system with support for blinded capabilities. BLACKOUT makes writing side-channel-resistant code easier by making non-data-oblivious operations via blinded capabilities explicitly fault. Through rigorous evaluation we show that BLACKOUT ensures memory operated on through blinded capabilities is securely allocated, used, and reclaimed and demonstrate that, in benchmarks comparable to those used by previous work, BLACKOUT imposes only a small performance degradation (1.5% geometric mean) compared to the baseline CHERI-Toooba processor.

Title: LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Authors: Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu
Subjects: cs.LG, cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2504.14655
Pdf URL: https://arxiv.org/pdf/2504.14655
Copy Paste: [[2504.14655]] LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs(https://arxiv.org/abs/2504.14655)
Keywords: robust
Abstract: We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

Title: A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs

Authors: Yihan Lin, Zhirong Bella Yu, Simon Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14657
Pdf URL: https://arxiv.org/pdf/2504.14657
Copy Paste: [[2504.14657]] A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs(https://arxiv.org/abs/2504.14657)
Keywords: privacy, fair, large language model
Abstract: Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.

Title: EmoSEM: Segment and Explain Emotion Stimuli in Visual Art

Authors: Jing Zhang, Dan Guo, Zhangbin Li, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14658
Pdf URL: https://arxiv.org/pdf/2504.14658
Copy Paste: [[2504.14658]] EmoSEM: Segment and Explain Emotion Stimuli in Visual Art(https://arxiv.org/abs/2504.14658)
Keywords: segmentation
Abstract: This paper focuses on a key challenge in visual art understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for the emotional arousal. Despite recent advances in art understanding, pixel-level emotion understanding still faces a dual challenge: first, the subjectivity of emotion makes it difficult for general segmentation models like SAM to adapt to emotion-oriented segmentation tasks; and second, the abstract nature of art expression makes it difficult for captioning models to balance pixel-level semantic understanding and emotion reasoning. To solve the above problems, this paper proposes the Emotion stimuli Segmentation and Explanation Model (EmoSEM) to endow the segmentation model SAM with emotion comprehension capability. First, to enable the model to perform segmentation under the guidance of emotional intent well, we introduce an emotional prompt with a learnable mask token as the conditional input for segmentation decoding. Then, we design an emotion projector to establish the association between emotion and visual features. Next, more importantly, to address emotion-visual stimuli alignment, we develop a lightweight prefix projector, a module that fuses the learned emotional mask with the corresponding emotion into a unified representation compatible with the language this http URL, we input the joint visual, mask, and emotional tokens into the language model and output the emotional explanations. It ensures that the generated interpretations remain semantically and emotionally coherent with the visual stimuli. The method innovatively realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, providing the first interpretable fine-grained analysis framework for artistic emotion computing. Extensive experiments validate the effectiveness of our model.

Title: Frequency-domain Learning with Kernel Prior for Blind Image Deblurring

Authors: Jixiang Sun, Fei Lei, Jiawei Zhang, Wenxiu Sun, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14664
Pdf URL: https://arxiv.org/pdf/2504.14664
Copy Paste: [[2504.14664]] Frequency-domain Learning with Kernel Prior for Blind Image Deblurring(https://arxiv.org/abs/2504.14664)
Keywords: robust
Abstract: While achieving excellent results on various datasets, many deep learning methods for image deblurring suffer from limited generalization capabilities with out-of-domain data. This limitation is likely caused by their dependence on certain domain-specific datasets. To address this challenge, we argue that it is necessary to introduce the kernel prior into deep learning methods, as the kernel prior remains independent of the image context. For effective fusion of kernel prior information, we adopt a rational implementation method inspired by traditional deblurring algorithms that perform deconvolution in the frequency domain. We propose a module called Frequency Integration Module (FIM) for fusing the kernel prior and combine it with a frequency-based deblurring Transfomer network. Experimental results demonstrate that our method outperforms state-of-the-art methods on multiple blind image deblurring tasks, showcasing robust generalization abilities. Source code will be available soon.

Title: Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Authors: Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14666
Pdf URL: https://arxiv.org/pdf/2504.14666
Copy Paste: [[2504.14666]] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens(https://arxiv.org/abs/2504.14666)
Keywords: diffusion, generative, large language model
Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: this https URL.

Title: Efficient Federated Split Learning for Large Language Models over Communication Networks

Authors: Kai Zhao, Zhaohui Yang
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2504.14667
Pdf URL: https://arxiv.org/pdf/2504.14667
Copy Paste: [[2504.14667]] Efficient Federated Split Learning for Large Language Models over Communication Networks(https://arxiv.org/abs/2504.14667)
Keywords: privacy, federate, large language model
Abstract: Fine-tuning pre-trained large language models (LLM) in a distributed manner poses significant challenges on resource-constrained edge devices. To address this challenge, we propose FedsLLM, a novel framework that integrates split federated learning with parameter-efficient fine-tuning techniques. By leveraging model splitting and Low-Rank Adaptation (LoRA), FedsLLM reduces the computational burden on edge devices. Furthermore, the introduction of a federated server facilitates parallel training and enhances privacy. To accommodate heterogeneous communication conditions and diverse computational capabilities of edge devices, as well as the impact of LoRA rank selection on model convergence and training cost, we formulate a joint optimization problem. The formulated problem jointly optimizes subchannel allocation, power control, model splitting point selection, and LoRA rank configuration, all aimed at minimizing total training delay. An alternating optimization algorithm is developed to efficiently solve this problem and accelerate the training process. Simulation results demonstrate that the proposed FedsLLM framework achieves comparable model accuracy while significantly reducing client-side computational requirements. Furthermore, the proposed resource allocation scheme and adaptive LoRA rank selection strategy notably reduce the training latency compared to conventional approaches.

Title: Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data

Authors: Wei Zou, Sen Yang, Yu Bao, Shujian Huang, Jiajun Chen, Shanbo Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14669
Pdf URL: https://arxiv.org/pdf/2504.14669
Copy Paste: [[2504.14669]] Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data(https://arxiv.org/abs/2504.14669)
Keywords: robust, large language model
Abstract: The rise of Large Language Models (LLMs) has reshaped machine translation (MT), but multilingual MT still relies heavily on parallel data for supervised fine-tuning (SFT), facing challenges like data scarcity for low-resource languages and catastrophic forgetting. To address these issues, we propose TRANS-ZERO, a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of LLM. TRANS-ZERO combines Genetic Monte-Carlo Tree Search (G-MCTS) with preference optimization, achieving strong translation performance that rivals supervised methods. Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions. Further analysis reveals that G-MCTS itself significantly enhances translation quality by exploring semantically consistent candidates through iterative translations, providing a robust foundation for the framework's succuss.

Title: Evaluating Temporal Plasticity in Foundation Time Series Models for Incremental Fine-tuning

Authors: Jia Liu, Cheng Jinguo, Xia Fang, Zhenyuan Ma, Yuankai Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14677
Pdf URL: https://arxiv.org/pdf/2504.14677
Copy Paste: [[2504.14677]] Evaluating Temporal Plasticity in Foundation Time Series Models for Incremental Fine-tuning(https://arxiv.org/abs/2504.14677)
Keywords: robust
Abstract: Time series foundation models excel at diverse time series forecasting tasks, but their capacity for continuous improvement through incremental learning remains unexplored. We present the first comprehensive study investigating these models' temporal plasticity - their ability to progressively enhance performance through continual learning while maintaining existing capabilities. Through experiments on real-world datasets exhibiting distribution shifts, we evaluate both conventional deep learning models and foundation models using a novel continual learning framework. Our findings reveal that while traditional models struggle with performance deterioration during incremental fine-tuning, foundation models like Time-MoE and Chronos demonstrate sustained improvement in predictive accuracy. This suggests that optimizing foundation model fine-tuning strategies may be more valuable than developing domain-specific small models. Our research introduces new evaluation methodologies and insights for developing foundation time series models with robust continuous learning capabilities.

Title: Seurat: From Moving Points to Depth

Authors: Seokju Cho, Jiahui Huang, Seungryong Kim, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14687
Pdf URL: https://arxiv.org/pdf/2504.14687
Copy Paste: [[2504.14687]] Seurat: From Moving Points to Depth(https://arxiv.org/abs/2504.14687)
Keywords: robust, transformer
Abstract: Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Title: FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

Authors: Mehrnoush Shamsfard, Zahra Saaberi, Mostafa Karimi manesh, Seyed Mohammad Hossein Hashemi, Zahra Vatankhah, Motahareh Ramezani, Niki Pourazin, Tara Zare, Maryam Azimi, Sarina Chitsaz, Sama Khoraminejad, Morteza Mahdavi Mortazavi, Mohammad Mahdi Chizari, Sahar Maleki, Seyed Soroush Majd, Mostafa Masumi, Sayed Ali Musavi Khoeini, Amir Mohseni, Sogol Alipour
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14690
Pdf URL: https://arxiv.org/pdf/2504.14690
Copy Paste: [[2504.14690]] FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models(https://arxiv.org/abs/2504.14690)
Keywords: large language model
Abstract: Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark

Title: Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Authors: Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14693
Pdf URL: https://arxiv.org/pdf/2504.14693
Copy Paste: [[2504.14693]] Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark(https://arxiv.org/abs/2504.14693)
Keywords: large language model
Abstract: Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

Title: Learning Critically: Selective Self Distillation in Federated Learning on Non-IID Data

Authors: Yuting He, Yiqiang Chen, XiaoDong Yang, Hanchao Yu, Yi-Hua Huang, Yang Gu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14694
Pdf URL: https://arxiv.org/pdf/2504.14694
Copy Paste: [[2504.14694]] Learning Critically: Selective Self Distillation in Federated Learning on Non-IID Data(https://arxiv.org/abs/2504.14694)
Keywords: robust, federate
Abstract: Federated learning (FL) enables multiple clients to collaboratively train a global model while keeping local data decentralized. Data heterogeneity (non-IID) across clients has imposed significant challenges to FL, which makes local models re-optimize towards their own local optima and forget the global knowledge, resulting in performance degradation and convergence slowdown. Many existing works have attempted to address the non-IID issue by adding an extra global-model-based regularizing item to the local training but without an adaption scheme, which is not efficient enough to achieve high performance with deep learning models. In this paper, we propose a Selective Self-Distillation method for Federated learning (FedSSD), which imposes adaptive constraints on the local updates by self-distilling the global model's knowledge and selectively weighting it by evaluating the credibility at both the class and sample level. The convergence guarantee of FedSSD is theoretically analyzed and extensive experiments are conducted on three public benchmark datasets, which demonstrates that FedSSD achieves better generalization and robustness in fewer communication rounds, compared with other state-of-the-art FL methods.

Title: Quantitative Clustering in Mean-Field Transformer Models

Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet
Subjects: cs.LG, math.AP, math.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14697
Pdf URL: https://arxiv.org/pdf/2504.14697
Copy Paste: [[2504.14697]] Quantitative Clustering in Mean-Field Transformer Models(https://arxiv.org/abs/2504.14697)
Keywords: transformer
Abstract: The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point mass for any suitably regular initialization under some assumptions on the parameters of transformer models, any suitably regular mean-field initialization synchronizes exponentially fast with some quantitative rates.

Title: Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods

Authors: Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14701
Pdf URL: https://arxiv.org/pdf/2504.14701
Copy Paste: [[2504.14701]] Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods(https://arxiv.org/abs/2504.14701)
Keywords: interpretability
Abstract: Recently, it has been observed that when training a deep neural net with SGD, the majority of the loss landscape's curvature quickly concentrates in a tiny *top* eigenspace of the loss Hessian, which remains largely stable thereafter. Independently, it has been shown that successful magnitude pruning masks for deep neural nets emerge early in training and remain stable thereafter. In this work, we study these two phenomena jointly and show that they are connected: We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics. We identify *overlap* as the most useful such metric due to its interpretability and stability. To compute *overlap*, we develop a matrix-free algorithm based on sketched SVDs that allows us to compute over 1000 Hessian eigenpairs for nets with over 10M parameters --an unprecedented scale by several orders of magnitude. Our experiments reveal an *overlap* between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that *top Hessian eigenvectors tend to be concentrated around larger parameters*, or equivalently, that *larger parameters tend to align with directions of larger loss curvature*. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.

Title: Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

Authors: Ratna Kandala, Katie Hoemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14707
Pdf URL: https://arxiv.org/pdf/2504.14707
Copy Paste: [[2504.14707]] Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives(https://arxiv.org/abs/2504.14707)
Keywords: robust
Abstract: This study explores BERTopic's potential for modeling open-ended Belgian Dutch daily narratives, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans. Although LDA scores well on certain automated metrics, human evaluations reveal semantically irrelevant co-occurrences, highlighting the limitations of purely statistic-based methods. In contrast, BERTopic's reliance on contextual embeddings yields culturally resonant themes, underscoring the importance of hybrid evaluation frameworks that account for morphologically rich languages. KMeans performed less coherently than prior research suggested, pointing to the unique challenges posed by personal narratives. Our findings emphasize the need for robust generalization in NLP models, especially in underrepresented linguistic contexts.

Title: Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features

Authors: Parshuram N. Aarotale, Ajita Rattani
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14708
Pdf URL: https://arxiv.org/pdf/2504.14708
Copy Paste: [[2504.14708]] Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features(https://arxiv.org/abs/2504.14708)
Keywords: robust
Abstract: Electromyography (EMG) based hand gesture recognition converts forearm muscle activity into control commands for prosthetics, rehabilitation, and human computer interaction. This paper proposes a novel approach to EMG-based hand gesture recognition that uses fine-grained classification and presents XMANet, which unifies low-level local and high level semantic cues through cross layer mutual attention among shallow to deep CNN experts. Using stacked spectrograms and scalograms derived from the Short Time Fourier Transform (STFT) and Wavelet Transform (WT), we benchmark XMANet against ResNet50, DenseNet-121, MobileNetV3, and EfficientNetB0. Experimental results on the Grabmyo dataset indicate that, using STFT, the proposed XMANet model outperforms the baseline ResNet50, EfficientNetB0, MobileNetV3, and DenseNet121 models with improvement of approximately 1.72%, 4.38%, 5.10%, and 2.53%, respectively. When employing the WT approach, improvements of around 1.57%, 1.88%, 1.46%, and 2.05% are observed over the same baselines. Similarly, on the FORS EMG dataset, the XMANet(ResNet50) model using STFT shows an improvement of about 5.04% over the baseline ResNet50. In comparison, the XMANet(DenseNet121) and XMANet(MobileNetV3) models yield enhancements of approximately 4.11% and 2.81%, respectively. Moreover, when using WT, the proposed XMANet achieves gains of around 4.26%, 9.36%, 5.72%, and 6.09% over the baseline ResNet50, DenseNet121, MobileNetV3, and EfficientNetB0 models, respectively. These results confirm that XMANet consistently improves performance across various architectures and signal processing techniques, demonstrating the strong potential of fine grained features for accurate and robust EMG classification.

Title: Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation

Authors: Md. Sanaullah Chowdhury, Salauddin Tapu, Noyon Kumar Sarkar, Ferdous Bin Ali, Lameya Sabrin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14715
Pdf URL: https://arxiv.org/pdf/2504.14715
Copy Paste: [[2504.14715]] Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation(https://arxiv.org/abs/2504.14715)
Keywords: robust, extraction, segmentation
Abstract: Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.

Title: Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation

Authors: Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14716
Pdf URL: https://arxiv.org/pdf/2504.14716
Copy Paste: [[2504.14716]] Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation(https://arxiv.org/abs/2504.14716)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are widely used as proxies for human labelers in both training (Reinforcement Learning from AI Feedback) and large-scale response evaluation (LLM-as-a-judge). Alignment and evaluation are critical components in the development of reliable LLMs, and the choice of feedback protocol plays a central role in both but remains understudied. In this work, we show that the choice of feedback protocol (absolute scores versus relative preferences) can significantly affect evaluation reliability and induce systematic biases. In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation. Generator models can exploit spurious attributes (or distractor features) favored by the LLM judge, resulting in inflated scores for lower-quality outputs and misleading training signals. We find that absolute scoring is more robust to such manipulation, producing judgments that better reflect response quality and are less influenced by distractor features. Our results demonstrate that generator models can flip preferences by embedding distractor features, skewing LLM-as-a-judge comparisons and leading to inaccurate conclusions about model quality in benchmark evaluations. Pairwise preferences flip in about 35% of the cases, compared to only 9% for absolute scores. We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.

Title: TAPIP3D: Tracking Any Point in Persistent 3D Geometry

Authors: Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14717
Pdf URL: https://arxiv.org/pdf/2504.14717
Copy Paste: [[2504.14717]] TAPIP3D: Tracking Any Point in Persistent 3D Geometry(https://arxiv.org/abs/2504.14717)
Keywords: robust
Abstract: We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera motion is effectively canceled. TAPIP3D iteratively refines multi-frame 3D motion estimates within this stabilized representation, enabling robust tracking over extended periods. To manage the inherent irregularities of 3D point distributions, we propose a Local Pair Attention mechanism. This 3D contextualization strategy effectively exploits spatial relationships in 3D, forming informative feature neighborhoods for precise 3D trajectory estimation. Our 3D-centric approach significantly outperforms existing 3D point tracking methods and even enhances 2D tracking accuracy compared to conventional 2D pixel trackers when accurate depth is available. It supports inference in both camera coordinates (i.e., unstabilized) and world coordinates, and our results demonstrate that compensating for camera motion improves tracking performance. Our approach replaces the conventional 2D square correlation neighborhoods used in prior 2D and 3D trackers, leading to more robust and accurate results across various 3D point tracking benchmarks. Project Page: this https URL

Title: SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-training

Authors: Shuang Zeng, Lei Zhu, Xinliang Zhang, Hangzhou He, Yanye Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14737
Pdf URL: https://arxiv.org/pdf/2504.14737
Copy Paste: [[2504.14737]] SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-training(https://arxiv.org/abs/2504.14737)
Keywords: segmentation
Abstract: Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code will be released after acceptance.

Title: PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

Authors: Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14738
Pdf URL: https://arxiv.org/pdf/2504.14738
Copy Paste: [[2504.14738]] PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines(https://arxiv.org/abs/2504.14738)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

Title: AltGDmin: Alternating GD and Minimization for Partly-Decoupled (Federated) Optimization

Authors: Namrata Vaswani
Subjects: cs.LG, math.OC, stat.ME
Abstract URL: https://arxiv.org/abs/2504.14741
Pdf URL: https://arxiv.org/pdf/2504.14741
Copy Paste: [[2504.14741]] AltGDmin: Alternating GD and Minimization for Partly-Decoupled (Federated) Optimization(https://arxiv.org/abs/2504.14741)
Keywords: robust, federate
Abstract: This article describes a novel optimization solution framework, called alternating gradient descent (GD) and minimization (AltGDmin), that is useful for many problems for which alternating minimization (AltMin) is a popular solution. AltMin is a special case of the block coordinate descent algorithm that is useful for problems in which minimization w.r.t one subset of variables keeping the other fixed is closed form or otherwise reliably solved. Denote the two blocks/subsets of the optimization variables Z by Za, Zb, i.e., Z = {Za, Zb}. AltGDmin is often a faster solution than AltMin for any problem for which (i) the minimization over one set of variables, Zb, is much quicker than that over the other set, Za; and (ii) the cost function is differentiable w.r.t. Za. Often, the reason for one minimization to be quicker is that the problem is ``decoupled" for Zb and each of the decoupled problems is quick to solve. This decoupling is also what makes AltGDmin communication-efficient for federated settings. Important examples where this assumption holds include (a) low rank column-wise compressive sensing (LRCS), low rank matrix completion (LRMC), (b) their outlier-corrupted extensions such as robust PCA, robust LRCS and robust LRMC; (c) phase retrieval and its sparse and low-rank model based extensions; (d) tensor extensions of many of these problems such as tensor LRCS and tensor completion; and (e) many partly discrete problems where GD does not apply -- such as clustering, unlabeled sensing, and mixed linear regression. LRCS finds important applications in multi-task representation learning and few shot learning, federated sketching, and accelerated dynamic MRI. LRMC and robust PCA find important applications in recommender systems, computer vision and video analytics.

Title: Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches

Authors: Guodong Shen, Yuqi Ouyang, Junru Lu, Yixuan Yang, Victor Sanchez
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.14753
Pdf URL: https://arxiv.org/pdf/2504.14753
Copy Paste: [[2504.14753]] Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches(https://arxiv.org/abs/2504.14753)
Keywords: transformer
Abstract: Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer's capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection.

Title: Establishing Workload Identity for Zero Trust CI/CD: From Secrets to SPIFFE-Based Authentication

Authors: Surya Teja Avirneni
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.14760
Pdf URL: https://arxiv.org/pdf/2504.14760
Copy Paste: [[2504.14760]] Establishing Workload Identity for Zero Trust CI/CD: From Secrets to SPIFFE-Based Authentication(https://arxiv.org/abs/2504.14760)
Keywords: secure, attack
Abstract: CI/CD systems have become privileged automation agents in modern infrastructure, but their identity is still based on secrets or temporary credentials passed between systems. In enterprise environments, these platforms are centralized and shared across teams, often with broad cloud permissions and limited isolation. These conditions introduce risk, especially in the era of supply chain attacks, where implicit trust and static credentials leave systems exposed. This paper describes the shift from static credentials to OpenID Connect (OIDC) federation, and introduces SPIFFE (Secure Production Identity Framework for Everyone) as a runtime-issued, platform-neutral identity model for non-human actors. SPIFFE decouples identity from infrastructure, enabling strong, portable authentication across job runners and deployed workloads. We show how SPIFFE identities support policy alignment, workload attestation, and mutual authentication. The paper concludes by outlining next steps in enabling policy-based access, forming the basis of a broader Zero Trust architecture for CI/CD.

Title: Decoupling Identity from Access: Credential Broker Patterns for Secure CI/CD

Authors: Surya Teja Avirneni
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.14761
Pdf URL: https://arxiv.org/pdf/2504.14761
Copy Paste: [[2504.14761]] Decoupling Identity from Access: Credential Broker Patterns for Secure CI/CD(https://arxiv.org/abs/2504.14761)
Keywords: secure
Abstract: Credential brokers offer a way to separate identity from access in CI/CD systems. This paper shows how verifiable identities issued at runtime, such as those from SPIFFE, can be used with brokers to enable short-lived, policy-driven credentials for pipelines and workloads. We walk through practical design patterns, including brokers that issue tokens just in time, apply access policies, and operate across trust domains. These ideas help reduce static permissions, improve auditability, and support Zero Trust goals in deployment workflows. This is the second paper in a three-part series on secure CI/CD identity architecture.

Title: A Combinatorial Theory of Dropout: Subnetworks, Graph Geometry, and Generalization

Authors: Sahil Rajesh Dhayalkar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14762
Pdf URL: https://arxiv.org/pdf/2504.14762
Copy Paste: [[2504.14762]] A Combinatorial Theory of Dropout: Subnetworks, Graph Geometry, and Generalization(https://arxiv.org/abs/2504.14762)
Keywords: robust
Abstract: We propose a combinatorial and graph-theoretic theory of dropout by modeling training as a random walk over a high-dimensional graph of binary subnetworks. Each node represents a masked version of the network, and dropout induces stochastic traversal across this space. We define a subnetwork contribution score that quantifies generalization and show that it varies smoothly over the graph. Using tools from spectral graph theory, PAC-Bayes analysis, and combinatorics, we prove that generalizing subnetworks form large, connected, low-resistance clusters, and that their number grows exponentially with network width. This reveals dropout as a mechanism for sampling from a robust, structured ensemble of well-generalizing subnetworks with built-in redundancy. Extensive experiments validate every theoretical claim across diverse architectures. Together, our results offer a unified foundation for understanding dropout and suggest new directions for mask-guided regularization and subnetwork optimization.

Title: Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

Authors: Saniya Karwa, Navpreet Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14766
Pdf URL: https://arxiv.org/pdf/2504.14766
Copy Paste: [[2504.14766]] Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings(https://arxiv.org/abs/2504.14766)
Keywords: robust, interpretability
Abstract: Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.

Title: Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Authors: Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14772
Pdf URL: https://arxiv.org/pdf/2504.14772
Copy Paste: [[2504.14772]] Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions(https://arxiv.org/abs/2504.14772)
Keywords: generative, large language model
Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.

Title: Novel Concept-Oriented Synthetic Data approach for Training Generative AI-Driven Crystal Grain Analysis Using Diffusion Model

Authors: Ahmed Sobhi Saleh, Kristof Croes, Hajdin Ceric, Ingrid De Wolf, Houman Zahedmanesh
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.14782
Pdf URL: https://arxiv.org/pdf/2504.14782
Copy Paste: [[2504.14782]] Novel Concept-Oriented Synthetic Data approach for Training Generative AI-Driven Crystal Grain Analysis Using Diffusion Model(https://arxiv.org/abs/2504.14782)
Keywords: diffusion, generative
Abstract: The traditional techniques for extracting polycrystalline grain structures from microscopy images, such as transmission electron microscopy (TEM) and scanning electron microscopy (SEM), are labour-intensive, subjective, and time-consuming, limiting their scalability for high-throughput analysis. In this study, we present an automated methodology integrating edge detection with generative diffusion models to effectively identify grains, eliminate noise, and connect broken segments in alignment with predicted grain boundaries. Due to the limited availability of adequate images preventing the training of deep machine learning models, a new seven-stage methodology is employed to generate synthetic TEM images for training. This concept-oriented synthetic data approach can be extended to any field of interest where the scarcity of data is a challenge. The presented model was applied to various metals with average grain sizes down to the nanoscale, producing grain morphologies from low-resolution TEM images that are comparable to those obtained from advanced and demanding experimental techniques with an average accuracy of 97.23%.

Title: How Effective Can Dropout Be in Multiple Instance Learning ?

Authors: Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang
Subjects: cs.CV, cs.AI, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14783
Pdf URL: https://arxiv.org/pdf/2504.14783
Copy Paste: [[2504.14783]] How Effective Can Dropout Be in Multiple Instance Learning ?(https://arxiv.org/abs/2504.14783)
Keywords: attack
Abstract: Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at this https URL.

Title: When Cloud Removal Meets Diffusion Model in Remote Sensing

Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14785
Pdf URL: https://arxiv.org/pdf/2504.14785
Copy Paste: [[2504.14785]] When Cloud Removal Meets Diffusion Model in Remote Sensing(https://arxiv.org/abs/2504.14785)
Keywords: robust, diffusion
Abstract: Cloud occlusion significantly hinders remote sensing applications by obstructing surface information and complicating analysis. To address this, we propose DC4CR (Diffusion Control for Cloud Removal), a novel multimodal diffusion-based framework for cloud removal in remote sensing imagery. Our method introduces prompt-driven control, allowing selective removal of thin and thick clouds without relying on pre-generated cloud masks, thereby enhancing preprocessing efficiency and model adaptability. Additionally, we integrate low-rank adaptation for computational efficiency, subject-driven generation for improved generalization, and grouped learning to enhance performance on small datasets. Designed as a plug-and-play module, DC4CR seamlessly integrates into existing cloud removal models, providing a scalable and robust solution. Extensive experiments on the RICE and CUHK-CR datasets demonstrate state-of-the-art performance, achieving superior cloud removal across diverse conditions. This work presents a practical and efficient approach for remote sensing image processing with broad real-world applications.

Title: Enhanced Data-driven Topology Design Methodology with Multi-level Mesh and Correlation-based Mutation for Stress-related Multi-objective Optimization

Authors: Jun Yang, Shintaro Yamasaki
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2504.14790
Pdf URL: https://arxiv.org/pdf/2504.14790
Copy Paste: [[2504.14790]] Enhanced Data-driven Topology Design Methodology with Multi-level Mesh and Correlation-based Mutation for Stress-related Multi-objective Optimization(https://arxiv.org/abs/2504.14790)
Keywords: generative
Abstract: Topology optimization (TO) serves as a widely applied structural design approach to tackle various engineering problems. Nevertheless, sensitivity-based TO methods usually struggle with solving strongly nonlinear optimization problems. By leveraging high capacity of deep generative model, which is an influential machine learning technique, the sensitivity-free data-driven topology design (DDTD) methodology is regarded as an effective means of overcoming these issues. The DDTD methodology depends on initial dataset with a certain regularity, making its results highly sensitive to initial dataset quality. This limits its effectiveness and generalizability, especially for optimization problems without priori information. In this research, we proposed a multi-level mesh DDTD-based method with correlation-based mutation module to escape from the limitation of the quality of the initial dataset on the results and enhance computational efficiency. The core is to employ a correlation-based mutation module to assign new geometric features with physical meaning to the generated data, while utilizing a multi-level mesh strategy to progressively enhance the refinement of the structural representation, thus avoiding the maintenance of a high degree-of-freedom (DOF) representation throughout the iterative process. The proposed multi-level mesh DDTD-based method can be driven by a low quality initial dataset without the need for time-consuming construction of a specific dataset, thus significantly increasing generality and reducing application difficulty, while further lowering computational cost of DDTD methodology. Various comparison experiments with the traditional sensitivity-based TO methods on stress-related strongly nonlinear problems demonstrate the generality and effectiveness of the proposed method.

Title: Edge-boosted graph learning for functional brain connectivity analysis

Authors: David Yang, Mostafa Abdelmegeed, John Modl, Minjeong Kim
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2504.14796
Pdf URL: https://arxiv.org/pdf/2504.14796
Copy Paste: [[2504.14796]] Edge-boosted graph learning for functional brain connectivity analysis(https://arxiv.org/abs/2504.14796)
Keywords: generative
Abstract: Predicting disease states from functional brain connectivity is critical for the early diagnosis of severe neurodegenerative diseases such as Alzheimer's Disease and Parkinson's Disease. Existing studies commonly employ Graph Neural Networks (GNNs) to infer clinical diagnoses from node-based brain connectivity matrices generated through node-to-node similarities of regionally averaged fMRI signals. However, recent neuroscience studies found that such node-based connectivity does not accurately capture ``functional connections" within the brain. This paper proposes a novel approach to brain network analysis that emphasizes edge functional connectivity (eFC), shifting the focus to inter-edge relationships. Additionally, we introduce a co-embedding technique to integrate edge functional connections effectively. Experimental results on the ADNI and PPMI datasets demonstrate that our method significantly outperforms state-of-the-art GNN methods in classifying functional brain networks.

Title: Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models

Authors: Hao Xuan, Xingyu Li
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14798
Pdf URL: https://arxiv.org/pdf/2504.14798
Copy Paste: [[2504.14798]] Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models(https://arxiv.org/abs/2504.14798)
Keywords: security, privacy, protect, attack, robust, generative
Abstract: Machine Unlearning (MUL) is crucial for privacy protection and content regulation, yet recent studies reveal that traces of forgotten information persist in unlearned models, enabling adversaries to resurface removed knowledge. Existing verification methods only confirm whether unlearning was executed, failing to detect such residual information leaks. To address this, we introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery. To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA), a post-unlearning verification framework that actively probes models for forgotten traces using adversarial queries. Extensive experiments on discriminative and generative tasks show that existing unlearning techniques remain vulnerable, even when passing existing verification metrics. By establishing UMA as a practical verification tool, this study sets a new standard for assessing and enhancing machine unlearning security.

Title: Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends

Authors: Jiaxin GUO, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Zongyao Li, Hengchao Shang, Daimeng Wei, Hao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14804
Pdf URL: https://arxiv.org/pdf/2504.14804
Copy Paste: [[2504.14804]] Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends(https://arxiv.org/abs/2504.14804)
Keywords: robust, interpretability, large language model
Abstract: With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.

Title: Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment

Authors: Jinwoo Choi, Seung-Woo Seo
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2504.14805
Pdf URL: https://arxiv.org/pdf/2504.14805
Copy Paste: [[2504.14805]] Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment(https://arxiv.org/abs/2504.14805)
Keywords: extraction
Abstract: Reinforcement learning (RL) has made significant progress in various domains, but scaling it to long-horizon tasks with complex decision-making remains challenging. Skill learning attempts to address this by abstracting actions into higher-level behaviors. However, current approaches often fail to recognize semantically similar behaviors as the same skill and use fixed skill lengths, limiting flexibility and generalization. To address this, we propose Dynamic Contrastive Skill Learning (DCSL), a novel framework that redefines skill representation and learning. DCSL introduces three key ideas: state-transition based skill representation, skill similarity function learning, and dynamic skill length adjustment. By focusing on state transitions and leveraging contrastive learning, DCSL effectively captures the semantic context of behaviors and adapts skill lengths to match the appropriate temporal extent of behaviors. Our approach enables more flexible and adaptive skill extraction, particularly in complex or noisy datasets, and demonstrates competitive performance compared to existing methods in task completion and efficiency.

Title: On Self-improving Token Embeddings

Authors: Mario M. Kubek, Shiraj Pokharel, Thomas Böhme, Emma L. McDaniel, Herwig Unger, Armin R. Mikler
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14808
Pdf URL: https://arxiv.org/pdf/2504.14808
Copy Paste: [[2504.14808]] On Self-improving Token Embeddings(https://arxiv.org/abs/2504.14808)
Keywords: large language model
Abstract: This article introduces a novel and fast method for refining pre-trained static word or, more generally, token embeddings. By incorporating the embeddings of neighboring tokens in text corpora, it continuously updates the representation of each token, including those without pre-assigned embeddings. This approach effectively addresses the out-of-vocabulary problem, too. Operating independently of large language models and shallow neural networks, it enables versatile applications such as corpus exploration, conceptual search, and word sense disambiguation. The method is designed to enhance token representations within topically homogeneous corpora, where the vocabulary is restricted to a specific domain, resulting in more meaningful embeddings compared to general-purpose pre-trained vectors. As an example, the methodology is applied to explore storm events and their impacts on infrastructure and communities using narratives from a subset of the NOAA Storm Events database. The article also demonstrates how the approach improves the representation of storm-related terms over time, providing valuable insights into the evolving nature of disaster narratives.

Title: vApps: Verifiable Applications at Internet Scale

Authors: Isaac Zhang, Ryan Zarick, Bryan Pellegrino, Tan Li, Daniel Wong, Thomas Kim, Uma Roy, John Guibas, Kshitij Kulkarni
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14809
Pdf URL: https://arxiv.org/pdf/2504.14809
Copy Paste: [[2504.14809]] vApps: Verifiable Applications at Internet Scale(https://arxiv.org/abs/2504.14809)
Keywords: security, robust
Abstract: Blockchain technology promises decentralized, trustless, and interoperable infrastructure. However, widespread adoption remains hindered by issues such as limited scalability, high transaction costs, and the complexity of maintaining coherent verification logic across different blockchain layers. This paper introduces Verifiable Applications (vApps), a novel development framework designed to streamline the creation and deployment of verifiable blockchain computing applications. vApps offer a unified Rust-based Domain-Specific Language (DSL) within a comprehensive SDK, featuring modular abstractions for verification, proof generation, and inter-chain connectivity. This eases the developer's burden in securing diverse software components, allowing them to focus on application logic. The DSL also ensures that applications can automatically take advantage of specialized precompiles and hardware acceleration to achieve consistently high performance with minimal developer effort, as demonstrated by benchmark results for zero-knowledge virtual machines (zkVMs). Experiments show that native Rust execution eliminates interpretation overhead, delivering up to an 832x cycle count improvement compared to EVM-based approaches. Precompiled circuits accelerate proving by over 95%, while GPU acceleration boosts throughput by up to 30x and recursion compresses proof size by up to 230x, enabling succinct and efficient verification. The framework also supports seamless integration with Web2 and Web3 systems, enabling developers to focus solely on their application logic. Through modular architecture, robust security guarantees, and composability, vApps pave the way toward a trust-minimized and verifiable Internet-scale application environment.

Title: CSI2Dig: Recovering Digit Content from Smartphone Loudspeakers Using Channel State Information

Authors: Yangyang Gu, Xianglong Li, Haolin Wu, Jing Chen, Kun He, Ruiying Du, Cong Wu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14812
Pdf URL: https://arxiv.org/pdf/2504.14812
Copy Paste: [[2504.14812]] CSI2Dig: Recovering Digit Content from Smartphone Loudspeakers Using Channel State Information(https://arxiv.org/abs/2504.14812)
Keywords: security, extraction
Abstract: Eavesdropping on sounds emitted by mobile device loudspeakers can capture sensitive digital information, such as SMS verification codes, credit card numbers, and withdrawal passwords, which poses significant security risks. Existing schemes either require expensive specialized equipment, rely on spyware, or are limited to close-range signal acquisition. In this paper, we propose a scheme, CSI2Dig, for recovering digit content from Channel State Information (CSI) when digits are played through a smartphone loudspeaker. We observe that the electromagnetic interference caused by the audio signals from the loudspeaker affects the WiFi signals emitted by the phone's WiFi antenna. Building upon contrastive learning and denoising autoencoders, we develop a two-branch autoencoder network designed to amplify the impact of this electromagnetic interference on CSI. For feature extraction, we introduce the TS-Net, a model that captures relevant features from both the temporal and spatial dimensions of the CSI data. We evaluate our scheme across various devices, distances, volumes, and other settings. Experimental results demonstrate that our scheme can achieve an accuracy of 72.97%.

Title: A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm

Authors: Kazuhisa Fujita
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14814
Pdf URL: https://arxiv.org/pdf/2504.14814
Copy Paste: [[2504.14814]] A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm(https://arxiv.org/abs/2504.14814)
Keywords: extraction, diffusion
Abstract: Artificial neural networks are powerful tools capable of addressing various tasks. Although the backpropagation algorithm has become a standard training method for these neural networks, its lack of biological plausibility has inspired the development of alternative learning approaches. One such alternative is Kaneko's Error Diffusion Learning Algorithm (EDLA), a biologically motivated approach wherein a single global error signal diffuses throughout a network composed of paired excitatory-inhibitory sublayers, thereby eliminating the necessity for layer-wise backpropagation. This study presents a contemporary formulation of the EDLA framework and evaluates its effectiveness through parity check, regression, and image classification tasks. Our experimental results indicate that EDLA networks can consistently achieve high accuracy across these benchmarks, with performance efficiency and convergence speed notably influenced by the choice of learning rate, neuron count, and network depth. Further investigation of the internal representations formed by EDLA networks reveals their capacity for meaningful feature extraction, similar to traditional neural networks. These results suggest that EDLA is a biologically motivated alternative for training feedforward networks and will motivate future work on extending this method to biologically inspired neural networks.

Title: What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Authors: Xiaoyong Yuan, Xiaolong Ma, Linke Guo, Lan Zhang
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14815
Pdf URL: https://arxiv.org/pdf/2504.14815
Copy Paste: [[2504.14815]] What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale(https://arxiv.org/abs/2504.14815)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques like LoRA, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content, such as copyrighted material, private individuals, or harmful content. Despite the increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled model and 690 real-world community models sourced from a public DM sharing platform. PAIA achieves over 90% detection accuracy while reducing auditing time by 18-40x compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.

Title: ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Authors: Zhoujie Qian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14825
Pdf URL: https://arxiv.org/pdf/2504.14825
Copy Paste: [[2504.14825]] ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages(https://arxiv.org/abs/2504.14825)
Keywords: extraction, transformer
Abstract: Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and the requirement of a large amount of training data. To address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance.

Title: Distribution-aware Dataset Distillation for Efficient Image Restoration

Authors: Zhuoran Zheng, Xin Su, Chen Wu, Xiuyi Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14826
Pdf URL: https://arxiv.org/pdf/2504.14826
Copy Paste: [[2504.14826]] Distribution-aware Dataset Distillation for Efficient Image Restoration(https://arxiv.org/abs/2504.14826)
Keywords: transformer
Abstract: With the exponential increase in image data, training an image restoration model is laborious. Dataset distillation is a potential solution to this problem, yet current distillation techniques are a blank canvas in the field of image restoration. To fill this gap, we propose the Distribution-aware Dataset Distillation method (TripleD), a new framework that extends the principles of dataset distillation to image restoration. Specifically, TripleD uses a pre-trained vision Transformer to extract features from images for complexity evaluation, and the subset (the number of samples is much smaller than the original training set) is selected based on complexity. The selected subset is then fed through a lightweight CNN that fine-tunes the image distribution to align with the distribution of the original dataset at the feature level. To efficiently condense knowledge, the training is divided into two stages. Early stages focus on simpler, low-complexity samples to build foundational knowledge, while later stages select more complex and uncertain samples as the model matures. Our method achieves promising performance on multiple image restoration tasks, including multi-task image restoration, all-in-one image restoration, and ultra-high-definition image restoration tasks. Note that we can train a state-of-the-art image restoration model on an ultra-high-definition (4K resolution) dataset using only one consumer-grade GPU in less than 8 hours (500 savings in computing resources and immeasurable training time).

Title: Protecting Your Voice: Temporal-aware Robust Watermarking

Authors: Yue Li, Weizhi Liu, Dongdong Lin
Subjects: cs.CR, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2504.14832
Pdf URL: https://arxiv.org/pdf/2504.14832
Copy Paste: [[2504.14832]] Protecting Your Voice: Temporal-aware Robust Watermarking(https://arxiv.org/abs/2504.14832)
Keywords: protect, robust, watermark, generative
Abstract: The rapid advancement of generative models has led to the synthesis of real-fake ambiguous voices. To erase the ambiguity, embedding watermarks into the frequency-domain features of synthesized voices has become a common routine. However, the robustness achieved by choosing the frequency domain often comes at the expense of fine-grained voice features, leading to a loss of fidelity. Maximizing the comprehensive learning of time-domain features to enhance fidelity while maintaining robustness, we pioneer a \textbf{\underline{t}}emporal-aware \textbf{\underline{r}}ob\textbf{\underline{u}}st wat\textbf{\underline{e}}rmarking (\emph{True}) method for protecting the speech and singing voice.

Title: Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning

Authors: Xixi Wan, Aihua Zheng, Zi Wang, Bo Jiang, Jin Tang, Jixin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14847
Pdf URL: https://arxiv.org/pdf/2504.14847
Copy Paste: [[2504.14847]] Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning(https://arxiv.org/abs/2504.14847)
Keywords: extraction
Abstract: Multi-modal data provides abundant and diverse object information, crucial for effective modal interactions in Re-Identification (ReID) tasks. However, existing approaches often overlook the quality variations in local features and fail to fully leverage the complementary information across modalities, particularly in the case of low-quality features. In this paper, we propose to address this issue by leveraging a novel graph reasoning model, termed the Modality-aware Graph Reasoning Network (MGRNet). Specifically, we first construct modality-aware graphs to enhance the extraction of fine-grained local details by effectively capturing and modeling the relationships between patches. Subsequently, the selective graph nodes swap operation is employed to alleviate the adverse effects of low-quality local features by considering both local and global information, enhancing the representation of discriminative information. Finally, the swapped modality-aware graphs are fed into the local-aware graph reasoning module, which propagates multi-modal information to yield a reliable feature representation. Another advantage of the proposed graph reasoning approach is its ability to reconstruct missing modal information by exploiting inherent structural relationships, thereby minimizing disparities between different modalities. Experimental results on four benchmarks (RGBNT201, Market1501-MM, RGBNT100, MSVR310) indicate that the proposed method achieves state-of-the-art performance in multi-modal object ReID. The code for our method will be available upon acceptance.

Title: Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation

Authors: Yunpu Zhao, Rui Zhang, Junbin Xiao, Ruibo Hou, Jiaming Guo, Zihao Zhang, Yifan Hao, Yunji Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14848
Pdf URL: https://arxiv.org/pdf/2504.14848
Copy Paste: [[2504.14848]] Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation(https://arxiv.org/abs/2504.14848)
Keywords: interpretability
Abstract: Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.

Title: Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation

Authors: Jiajun Shen, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14856
Pdf URL: https://arxiv.org/pdf/2504.14856
Copy Paste: [[2504.14856]] Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation(https://arxiv.org/abs/2504.14856)
Keywords: large language model
Abstract: While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.

Title: Natural Fingerprints of Large Language Models

Authors: Teppei Suzuki, Ryokan Ri, Sho Takase
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14871
Pdf URL: https://arxiv.org/pdf/2504.14871
Copy Paste: [[2504.14871]] Natural Fingerprints of Large Language Models(https://arxiv.org/abs/2504.14871)
Keywords: fair, large language model
Abstract: Large language models (LLMs) often exhibit biases -- systematic deviations from expected norms -- in their outputs. These range from overt issues, such as unfair responses, to subtler patterns that can reveal which model produced them. We investigate the factors that give rise to identifiable characteristics in LLMs. Since LLMs model training data distribution, it is reasonable that differences in training data naturally lead to the characteristics. However, our findings reveal that even when LLMs are trained on the exact same data, it is still possible to distinguish the source model based on its generated text. We refer to these unintended, distinctive characteristics as natural fingerprints. By systematically controlling training conditions, we show that the natural fingerprints can emerge from subtle differences in the training process, such as parameter sizes, optimization settings, and even random seeds. We believe that understanding natural fingerprints offers new insights into the origins of unintended bias and ways for improving control over LLM behavior.

Title: Collaborative Enhancement Network for Low-quality Multi-spectral Vehicle Re-identification

Authors: Aihua Zheng, Yongqi Sun, Zi Wang, Chenglong Li, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14877
Pdf URL: https://arxiv.org/pdf/2504.14877
Copy Paste: [[2504.14877]] Collaborative Enhancement Network for Low-quality Multi-spectral Vehicle Re-identification(https://arxiv.org/abs/2504.14877)
Keywords: robust
Abstract: The performance of multi-spectral vehicle Re-identification (ReID) is significantly degraded when some important discriminative cues in visible, near infrared and thermal infrared spectra are lost. Existing methods generate or enhance missing details in low-quality spectra data using the high-quality one, generally called the primary spectrum, but how to justify the primary spectrum is a challenging problem. In addition, when the quality of the primary spectrum is low, the enhancement effect would be greatly degraded, thus limiting the performance of multi-spectral vehicle ReID. To address these problems, we propose the Collaborative Enhancement Network (CoEN), which generates a high-quality proxy from all spectra data and leverages it to supervise the selection of primary spectrum and enhance all spectra features in a collaborative manner, for robust multi-spectral vehicle ReID. First, to integrate the rich cues from all spectra data, we design the Proxy Generator (PG) to progressively aggregate multi-spectral features. Second, we design the Dynamic Quality Sort Module (DQSM), which sorts all spectra data by measuring their correlations with the proxy, to accurately select the primary spectra with the highest correlation. Finally, we design the Collaborative Enhancement Module (CEM) to effectively compensate for missing contents of all spectra by collaborating the primary spectra and the proxy, thereby mitigating the impact of low-quality primary spectra. Extensive experiments on three benchmark datasets are conducted to validate the efficacy of the proposed approach against other multi-spectral vehicle ReID methods. The codes will be released at this https URL.

Title: Impact of Latent Space Dimension on IoT Botnet Detection Performance: VAE-Encoder Versus ViT-Encoder

Authors: Hassan Wasswa, Aziida Nanyonga, Timothy Lynar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14879
Pdf URL: https://arxiv.org/pdf/2504.14879
Copy Paste: [[2504.14879]] Impact of Latent Space Dimension on IoT Botnet Detection Performance: VAE-Encoder Versus ViT-Encoder(https://arxiv.org/abs/2504.14879)
Keywords: security, attack, transformer
Abstract: The rapid evolution of Internet of Things (IoT) technology has led to a significant increase in the number of IoT devices, applications, and services. This surge in IoT devices, along with their widespread presence, has made them a prime target for various cyber-attacks, particularly through IoT botnets. As a result, security has become a major concern within the IoT ecosystem. This study focuses on investigating how the latent dimension impacts the performance of different deep learning classifiers when trained on latent vector representations of the train dataset. The primary objective is to compare the outcomes of these models when encoder components from two cutting-edge architectures: the Vision Transformer (ViT) and the Variational Auto-Encoder (VAE) are utilized to project the high dimensional train dataset to the learned low dimensional latent space. The encoder components are employed to project high-dimensional structured .csv IoT botnet traffic datasets to various latent sizes. Evaluated on N-BaIoT and CICIoT2022 datasets, findings reveal that VAE-encoder based dimension reduction outperforms ViT-encoder based dimension reduction for both datasets in terms of four performance metrics including accuracy, precision, recall, and F1-score for all models which can be attributed to absence of spatial patterns in the datasets the ViT model attempts to learn and extract from image instances.

Title: Towards Fuzzing Zero-Knowledge Proof Circuits (Short Paper)

Authors: Stefanos Chaliasos, Imam Al-Fath, Alastair Donaldson
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14881
Pdf URL: https://arxiv.org/pdf/2504.14881
Copy Paste: [[2504.14881]] Towards Fuzzing Zero-Knowledge Proof Circuits (Short Paper)(https://arxiv.org/abs/2504.14881)
Keywords: privacy
Abstract: Zero-knowledge proofs (ZKPs) have evolved from a theoretical cryptographic concept into a powerful tool for implementing privacy-preserving and verifiable applications without requiring trust assumptions. Despite significant progress in the field, implementing and using ZKPs via \emph{ZKP circuits} remains challenging, leading to numerous bugs that affect ZKP circuits in practice, and \emph{fuzzing} remains largely unexplored as a method to detect bugs in ZKP circuits. We discuss the unique challenges of applying fuzzing to ZKP circuits, examine the oracle problem and its potential solutions, and propose techniques for input generation and test harness construction. We demonstrate that fuzzing can be effective in this domain by implementing a fuzzer for \texttt{zk-regex}, a cornerstone library in modern ZKP applications. In our case study, we discovered \textit{$10$} new bugs.

Title: Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness

Authors: Mojtaba Kolahdouzi, Hatice Gunes, Ali Etemad
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14882
Pdf URL: https://arxiv.org/pdf/2504.14882
Copy Paste: [[2504.14882]] Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness(https://arxiv.org/abs/2504.14882)
Keywords: fair
Abstract: We study whether and how the choice of optimization algorithm can impact group fairness in deep neural networks. Through stochastic differential equation analysis of optimization dynamics in an analytically tractable setup, we demonstrate that the choice of optimization algorithm indeed influences fairness outcomes, particularly under severe imbalance. Furthermore, we show that when comparing two categories of optimizers, adaptive methods and stochastic methods, RMSProp (from the adaptive category) has a higher likelihood of converging to fairer minima than SGD (from the stochastic category). Building on this insight, we derive two new theoretical guarantees showing that, under appropriate conditions, RMSProp exhibits fairer parameter updates and improved fairness in a single optimization step compared to SGD. We then validate these findings through extensive experiments on three publicly available datasets, namely CelebA, FairFace, and MS-COCO, across different tasks as facial expression recognition, gender classification, and multi-label classification, using various backbones. Considering multiple fairness definitions including equalized odds, equal opportunity, and demographic parity, adaptive optimizers like RMSProp and Adam consistently outperform SGD in terms of group fairness, while maintaining comparable predictive accuracy. Our results highlight the role of adaptive updates as a crucial yet overlooked mechanism for promoting fair outcomes.

Title: Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application

Authors: Matthew Gaber, Mohiuddin Ahmed, Helge Janicke
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14886
Pdf URL: https://arxiv.org/pdf/2504.14886
Copy Paste: [[2504.14886]] Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application(https://arxiv.org/abs/2504.14886)
Keywords: transformer
Abstract: The effectiveness of an AI model in accurately classifying novel malware hinges on the quality of the features it is trained on, which in turn depends on the effectiveness of the analysis tool used. Peekaboo, a Dynamic Binary Instrumentation (DBI) tool, defeats malware evasion techniques to capture authentic behavior at the Assembly (ASM) instruction level. This behavior exhibits patterns consistent with Zipf's law, a distribution commonly seen in natural languages, making Transformer models particularly effective for binary classification tasks. We introduce Alpha, a framework for zero day malware detection that leverages Transformer models and ASM language. Alpha is trained on malware and benign software data collected through Peekaboo, enabling it to identify entirely new samples with exceptional accuracy. Alpha eliminates any common functions from the test samples that are in the training dataset. This forces the model to rely on contextual patterns and novel ASM instruction combinations to detect malicious behavior, rather than memorizing familiar features. By combining the strengths of DBI, ASM analysis, and Transformer architectures, Alpha offers a powerful approach to proactively addressing the evolving threat of malware. Alpha demonstrates perfect accuracy for Ransomware, Worms and APTs with flawless classification for both malicious and benign samples. The results highlight the model's exceptional performance in detecting truly new malware samples.

Title: WMKA-Net: A Weighted Multi-Kernel Attention NetworkMethod for Retinal Vessel Segmentation

Authors: Xinran Xu, Yuliang Ma, Sifu Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14888
Pdf URL: https://arxiv.org/pdf/2504.14888
Copy Paste: [[2504.14888]] WMKA-Net: A Weighted Multi-Kernel Attention NetworkMethod for Retinal Vessel Segmentation(https://arxiv.org/abs/2504.14888)
Keywords: robust, segmentation
Abstract: We propose a novel retinal vessel segmentation network, the Weighted Multi-Kernel Attention Network (WMKA-Net), which aims to address the issues of insufficient multiscale feature capture, loss of contextual information, and noise sensitivity in retinal vessel segmentation. WMKA-Net significantly improves the segmentation performance of small vessels and low-contrast regions by integrating several innovative components, including the MultiKernelFeature Fusion Module (MKDC), the Progressive Feature Weighting Fusion Strategy (UDFF), and the Attention Mechanism Module (AttentionBlock). The MKDC module employs multiscale parallel convolutional kernels to extract vessel characteristics, thereby enhancing the ability to capture complex vascular structures. The UDFF strategy optimizes the transmission of feature information by weighted fusion of high- and low-level features. The AttentionBlock highlights key regions and suppresses noise interference through the attention mechanism. Experimental results demonstrate that WMKA-Net achieves excellent segmentation performance in multiple public datasets, particularly in segmentation of small vessels and processing of pathological regions. This work provides a robust and efficient new method for segmentation of the retinal vessel.

Title: Latent Bayesian Optimization via Autoregressive Normalizing Flows

Authors: Seunghun Lee, Jinyoung Park, Jaewon Chu, Minseo Yoon, Hyunwoo J. Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14889
Pdf URL: https://arxiv.org/pdf/2504.14889
Copy Paste: [[2504.14889]] Latent Bayesian Optimization via Autoregressive Normalizing Flows(https://arxiv.org/abs/2504.14889)
Keywords: generative
Abstract: Bayesian Optimization (BO) has been recognized for its effectiveness in optimizing expensive and complex objective functions. Recent advancements in Latent Bayesian Optimization (LBO) have shown promise by integrating generative models such as variational autoencoders (VAEs) to manage the complexity of high-dimensional and structured data spaces. However, existing LBO approaches often suffer from the value discrepancy problem, which arises from the reconstruction gap between input and latent spaces. This value discrepancy problem propagates errors throughout the optimization process, leading to suboptimal outcomes. To address this issue, we propose a Normalizing Flow-based Bayesian Optimization (NF-BO), which utilizes normalizing flow as a generative model to establish one-to-one encoding function from the input space to the latent space, along with its left-inverse decoding function, eliminating the reconstruction gap. Specifically, we introduce SeqFlow, an autoregressive normalizing flow for sequence data. In addition, we develop a new candidate sampling strategy that dynamically adjusts the exploration probability for each token based on its importance. Through extensive experiments, our NF-BO method demonstrates superior performance in molecule generation tasks, significantly outperforming both traditional and recent LBO approaches.

Title: Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14891
Pdf URL: https://arxiv.org/pdf/2504.14891
Copy Paste: [[2504.14891]] Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey(https://arxiv.org/abs/2504.14891)
Keywords: large language model
Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

Title: Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Authors: Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14899
Pdf URL: https://arxiv.org/pdf/2504.14899
Copy Paste: [[2504.14899]] Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation(https://arxiv.org/abs/2504.14899)
Keywords: robust, generative
Abstract: Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.

Title: CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs

Authors: Yingming Zheng, Xiaoliang Liu, Peng Wu, Li Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14905
Pdf URL: https://arxiv.org/pdf/2504.14905
Copy Paste: [[2504.14905]] CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs(https://arxiv.org/abs/2504.14905)
Keywords: large language model
Abstract: The rapid spread of misinformation, driven by digital media and AI-generated content, has made automatic claim verification essential. Traditional methods, which depend on expert-annotated evidence, are labor-intensive and not scalable. Although recent automated systems have improved, they still struggle with complex claims that require nuanced reasoning. To address this, we propose CRAVE, a Conflicting Reasoning Approach for explainable claim VErification, that verify the complex claims based on the conflicting rationales reasoned by large language models (LLMs). Specifically, CRAVE introduces a three-module framework. Ambiguity Elimination enchanced Evidence Retrieval module performs ambiguity elimination and entity-based search to gather relevant evidence related to claim verification from external sources like Wikipedia. Conflicting Perspective Reasoning and Preliminary Judgment module with LLMs adopts LLMs to reason rationales with conflicting stances about claim verification from retrieved evidence across four dimensions, i.e., direct evidence, semantic relationships, linguistic patterns, and logical reasoning and make a preliminary judgment. Finally, Small Language Model (SLM) based Judge module is fine-tuned to make use of preliminary judgment from LLMs to assess the confidence of the conflicting rationales and make a final authenticity judgment. This methodology allows CRAVE to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. Extensive experiments on two public claim verification datasets demonstrate that our CRAVE model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for finding relevant evidence and explaining the model predictions. The code is provided at this https URL.

Title: POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications

Authors: Chunjing Gan, Dan Yang, Binbin Hu, Ziqi Liu, Yue Shen, Zhiqiang Zhang, Jian Wang, Jun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14917
Pdf URL: https://arxiv.org/pdf/2504.14917
Copy Paste: [[2504.14917]] POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications(https://arxiv.org/abs/2504.14917)
Keywords: large language model
Abstract: Large language models (LLMs) have become a disruptive force in the industry, introducing unprecedented capabilities in natural language processing, logical reasoning and so on. However, the challenges of knowledge updates and hallucination issues have limited the application of LLMs in medical scenarios, where retrieval-augmented generation (RAG) can offer significant assistance. Nevertheless, existing retrieve-then-read approaches generally digest the retrieved documents, without considering the timeliness, authoritativeness and commonality of retrieval. We argue that these approaches can be suboptimal, especially in real-world applications where information from different sources might conflict with each other and even information from the same source in different time scale might be different, and totally relying on this would deteriorate the performance of RAG approaches. We propose PolyRAG that carefully incorporate judges from different perspectives and finally integrate the polyviews for retrieval augmented generation in medical applications. Due to the scarcity of real-world benchmarks for evaluation, to bridge the gap we propose PolyEVAL, a benchmark consists of queries and documents collected from real-world medical scenarios (including medical policy, hospital & doctor inquiry and healthcare) with multiple tagging (e.g., timeliness, authoritativeness) on them. Extensive experiments and analysis on PolyEVAL have demonstrated the superiority of PolyRAG.

Title: GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

Authors: Donghyeong Kim, Chaewon Park, Suhwan Cho, Hyeonjeong Lim, Minseok Kang, Jungho Lee, Sangyoun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14919
Pdf URL: https://arxiv.org/pdf/2504.14919
Copy Paste: [[2504.14919]] GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection(https://arxiv.org/abs/2504.14919)
Keywords: robust
Abstract: Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

Title: Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos

Authors: Songping Wang, Hanqing Liu, Yueming Lyu, Xiantao Hu, Ziwen He, Wei Wang, Caifeng Shan, Liang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14921
Pdf URL: https://arxiv.org/pdf/2504.14921
Copy Paste: [[2504.14921]] Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos(https://arxiv.org/abs/2504.14921)
Keywords: attack, robust, transformer
Abstract: Adversarial Training (AT) has been shown to significantly enhance adversarial robustness via a min-max optimization approach. However, its effectiveness in video recognition tasks is hampered by two main challenges. First, fast adversarial training for video models remains largely unexplored, which severely impedes its practical applications. Specifically, most video adversarial training methods are computationally costly, with long training times and high expenses. Second, existing methods struggle with the trade-off between clean accuracy and adversarial robustness. To address these challenges, we introduce Video Fast Adversarial Training with Weak-to-Strong consistency (VFAT-WS), the first fast adversarial training method for video data. Specifically, VFAT-WS incorporates the following key designs: First, it integrates a straightforward yet effective temporal frequency augmentation (TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a single-step PGD attack to boost training efficiency and robustness. Second, it devises a weak-to-strong spatial-temporal consistency regularization, which seamlessly integrates the simpler TF-AUG and the more complex STF-AUG. Leveraging the consistency regularization, it steers the learning process from simple to complex augmentations. Both of them work together to achieve a better trade-off between clean accuracy and robustness. Extensive experiments on UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that VFAT-WS achieves great improvements in adversarial robustness and corruption robustness, while accelerating training by nearly 490%.

Title: TWIG: Two-Step Image Generation using Segmentation Masks in Diffusion Models

Authors: Mazharul Islam Rakib, Showrin Rahman, Joyanta Jyoti Mondal, Xi Xiao, David Lewis, Alessandra Mileo, Meem Arafat Manab
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14933
Pdf URL: https://arxiv.org/pdf/2504.14933
Copy Paste: [[2504.14933]] TWIG: Two-Step Image Generation using Segmentation Masks in Diffusion Models(https://arxiv.org/abs/2504.14933)
Keywords: protect, watermark, diffusion, generative, segmentation
Abstract: In today's age of social media and marketing, copyright issues can be a major roadblock to the free sharing of images. Generative AI models have made it possible to create high-quality images, but concerns about copyright infringement are a hindrance to their abundant use. As these models use data from training images to generate new ones, it is often a daunting task to ensure they do not violate intellectual property rights. Some AI models have even been noted to directly copy copyrighted images, a problem often referred to as source copying. Traditional copyright protection measures such as watermarks and metadata have also proven to be futile in this regard. To address this issue, we propose a novel two-step image generation model inspired by the conditional diffusion model. The first step involves creating an image segmentation mask for some prompt-based generated images. This mask embodies the shape of the image. Thereafter, the diffusion model is asked to generate the image anew while avoiding the shape in question. This approach shows a decrease in structural similarity from the training image, i.e. we are able to avoid the source copying problem using this approach without expensive retraining of the model or user-centered prompt generation techniques. This makes our approach the most computationally inexpensive approach to avoiding both copyright infringement and source copying for diffusion model-based image generation.

Title: Causal DAG Summarization (Full Version)

Authors: Anna Zeng, Michael Cafarella, Batya Kenig, Markos Markakis, Brit Youngmann, Babak Salimi
Subjects: cs.LG, cs.DB, stat.ME
Abstract URL: https://arxiv.org/abs/2504.14937
Pdf URL: https://arxiv.org/pdf/2504.14937
Copy Paste: [[2504.14937]] Causal DAG Summarization (Full Version)(https://arxiv.org/abs/2504.14937)
Keywords: robust
Abstract: Causal inference aids researchers in discovering cause-and-effect relationships, leading to scientific insights. Accurate causal estimation requires identifying confounding variables to avoid false discoveries. Pearl's causal model uses causal DAGs to identify confounding variables, but incorrect DAGs can lead to unreliable causal conclusions. However, for high dimensional data, the causal DAGs are often complex beyond human verifiability. Graph summarization is a logical next step, but current methods for general-purpose graph summarization are inadequate for causal DAG summarization. This paper addresses these challenges by proposing a causal graph summarization objective that balances graph simplification for better understanding while retaining essential causal information for reliable inference. We develop an efficient greedy algorithm and show that summary causal DAGs can be directly used for inference and are more robust to misspecification of assumptions, enhancing robustness for causal inference. Experimenting with six real-life datasets, we compared our algorithm to three existing solutions, showing its effectiveness in handling high-dimensional data and its ability to generate summary DAGs that ensure both reliable causal inference and robustness against misspecifications.

Title: PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV

Authors: Qianyu Zhu, Junjie Wang, Jeremiah Hu, Jia Ai, Yong Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.14952
Pdf URL: https://arxiv.org/pdf/2504.14952
Copy Paste: [[2504.14952]] PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV(https://arxiv.org/abs/2504.14952)
Keywords: diffusion
Abstract: Deep learning algorithms have significantly reduced the computational time and improved the spatial resolution of particle image velocimetry~(PIV). However, the models trained on synthetic datasets might have a degraded performance on practical particle images due to domain gaps. As a result, special residual patterns are often observed for the vector fields of deep learning-based estimators. To reduce the special noise step-by-step, we employ a denoising diffusion model~(FlowDiffuser) for PIV analysis. And the data-hungry iterative denoising diffusion model is trained via a transfer learning strategy, resulting in our PIV-FlowDiffuser method. Specifically, (1) pre-training a FlowDiffuser model with multiple optical flow datasets of the computer vision community, such as Sintel, KITTI, etc; (2) fine-tuning the pre-trained model on synthetic PIV datasets. Note that the PIV images are upsampled by a factor of two to resolve the small-scale turbulent flow structures. The visualized results indicate that our PIV-FlowDiffuser effectively suppresses the noise patterns. Therefore, the denoising diffusion model reduces the average end-point error~($AEE$) by 59.4% over RAFT256-PIV baseline on the classic Cai's dataset. Besides, PIV-FlowDiffuser exhibits enhanced generalization performance on unseen particle images due to transfer learning. Overall, this study highlights the transfer-learning-based denoising diffusion models for PIV. And a detailed implementation is recommended for interested readers in the repository this https URL.

Title: Efficient Document Retrieval with G-Retriever

Authors: Manthankumar Solanki
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14955
Pdf URL: https://arxiv.org/pdf/2504.14955
Copy Paste: [[2504.14955]] Efficient Document Retrieval with G-Retriever(https://arxiv.org/abs/2504.14955)
Keywords: large language model
Abstract: Textual data question answering has gained significant attention due to its growing applicability. Recently, a novel approach leveraging the Retrieval-Augmented Generation (RAG) method was introduced, utilizing the Prize-Collecting Steiner Tree (PCST) optimization for sub-graph construction. However, this method focused solely on node attributes, leading to incomplete contextual understanding. In this paper, we propose an enhanced approach that replaces the PCST method with an attention-based sub-graph construction technique, enabling more efficient and context-aware retrieval. Additionally, we encode both node and edge attributes, leading to richer graph representations. Our method also incorporates an improved projection layer and multi-head attention pooling for better alignment with Large Language Models (LLMs). Experimental evaluations on the WebQSP dataset demonstrate that our approach is competitive and achieves marginally better results compared to the original method, underscoring its potential for more accurate question answering.

Title: MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Authors: Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2504.14960
Pdf URL: https://arxiv.org/pdf/2504.14960
Copy Paste: [[2504.14960]] MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core(https://arxiv.org/abs/2504.14960)
Keywords: transformer
Abstract: Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Title: Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues

Authors: Rui Ribeiro, Luísa Coheur, Joao P. Carvalho
Subjects: cs.CL, cs.AI, cs.CE, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2504.14963
Pdf URL: https://arxiv.org/pdf/2504.14963
Copy Paste: [[2504.14963]] Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues(https://arxiv.org/abs/2504.14963)
Keywords: interpretability
Abstract: Speaker identification using voice recordings leverages unique acoustic features, but this approach fails when only textual data is available. Few approaches have attempted to tackle the problem of identifying speakers solely from text, and the existing ones have primarily relied on traditional methods. In this work, we explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification. We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy, reaching 70.6% on the Friends dataset and 67.7% on the Big Bang Theory dataset. Additionally, we show that fuzzy fingerprints can approximate full fine-tuning performance with fewer hidden units, offering improved interpretability. Finally, we analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines. Our findings highlight key challenges and provide insights for future improvements in text-based speaker identification.

Title: A Security Framework for General Blockchain Layer 2 Protocols

Authors: Zeta Avarikioti, Matteo Maffei, Yuheng Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.14965
Pdf URL: https://arxiv.org/pdf/2504.14965
Copy Paste: [[2504.14965]] A Security Framework for General Blockchain Layer 2 Protocols(https://arxiv.org/abs/2504.14965)
Keywords: secure, security
Abstract: Layer 2 (L2) solutions are the cornerstone of blockchain scalability, enabling high-throughput and low-cost interactions by shifting execution off-chain while maintaining security through interactions with the underlying ledger. Despite their common goals, the principal L2 paradigms -- payment channels, rollups, and sidechains -- differ substantially in architecture and assumptions, making it difficult to comparatively analyze their security and trade-offs. To address this, we present the first general security framework for L2 protocols. Our framework is based on the IITM-based Universal Composability (iUC) framework, in which L2 protocols are modeled as stateful machines interacting with higher-level protocol users and the underlying ledger. The methodology defines a generic execution environment that captures ledger events, message passing, and adversarial scheduling, and characterizes security through trace-based predicates parameterized by adversarial capabilities and timing assumptions. By abstracting away from protocol-specific details while preserving critical interface and execution behavior, the framework enables modular, protocol-agnostic reasoning and composable security proofs across a wide range of L2 constructions. To demonstrate its applicability, we analyze an example from each of the three dominant L2 scaling paradigms: a payment channel (Brick), a sidechain (Liquid Network), and a rollup (Arbitrum). By instantiating each within our framework, we derive their security properties and expose trade-offs. These include the time for dispute resolution, distribution of off-chain storage and computation, and varying trust assumptions (e.g., reliance on honest parties or data availability). Our framework unifies the analysis of diverse L2 designs and pinpoints their strengths and limitations, providing a foundation for secure, systematic L2 development.

Title: Evaluating LLMs on Chinese Topic Constructions: A Research Proposal Inspired by Tian et al. (2024)

Authors: Xiaodong Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14969
Pdf URL: https://arxiv.org/pdf/2504.14969
Copy Paste: [[2504.14969]] Evaluating LLMs on Chinese Topic Constructions: A Research Proposal Inspired by Tian et al. (2024)(https://arxiv.org/abs/2504.14969)
Keywords: large language model
Abstract: This paper proposes a framework for evaluating large language models (LLMs) on Chinese topic constructions, focusing on their sensitivity to island constraints. Drawing inspiration from Tian et al. (2024), we outline an experimental design for testing LLMs' grammatical knowledge of Mandarin syntax. While no experiments have been conducted yet, this proposal aims to provide a foundation for future studies and invites feedback on the methodology.

Title: Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization

Authors: Hongbin Xu, Chaohui Yu, Feng Xiao, Jiazheng Xing, Hai Ci, Weitao Chen, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14975
Pdf URL: https://arxiv.org/pdf/2504.14975
Copy Paste: [[2504.14975]] Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization(https://arxiv.org/abs/2504.14975)
Keywords: extraction, generative
Abstract: Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose \name{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. \emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. \emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that \name{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17\% PSNR for edge, +6.26\% PSNR for sketch).

Title: aiXamine: LLM Safety and Security Simplified

Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14985
Pdf URL: https://arxiv.org/pdf/2504.14985
Copy Paste: [[2504.14985]] aiXamine: LLM Safety and Security Simplified(https://arxiv.org/abs/2504.14985)
Keywords: security, privacy, attack, robust, fair, large language model
Abstract: Evaluating Large Language Models (LLMs) for safety and security remains a complex task, often requiring users to navigate a fragmented landscape of ad hoc benchmarks, datasets, metrics, and reporting formats. To address this challenge, we present aiXamine, a comprehensive black-box evaluation platform for LLM safety and security. aiXamine integrates over 40 tests (i.e., benchmarks) organized into eight key services targeting specific dimensions of safety and security: adversarial robustness, code security, fairness and bias, hallucination, model and data privacy, out-of-distribution (OOD) robustness, over-refusal, and safety alignment. The platform aggregates the evaluation results into a single detailed report per model, providing a detailed breakdown of model performance, test examples, and rich visualizations. We used aiXamine to assess over 50 publicly available and proprietary LLMs, conducting over 2K examinations. Our findings reveal notable vulnerabilities in leading models, including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0. Additionally, we observe that open-source models can match or exceed proprietary models in specific services such as safety alignment, fairness and bias, and OOD robustness. Finally, we identify trade-offs between distillation strategies, model size, training methods, and architectural choices.

Title: Efficient Pretraining Length Scaling

Authors: Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14992
Pdf URL: https://arxiv.org/pdf/2504.14992
Copy Paste: [[2504.14992]] Efficient Pretraining Length Scaling(https://arxiv.org/abs/2504.14992)
Keywords: transformer, large language model
Abstract: Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.

Title: Dual Utilization of Perturbation for Stream Data Publication under Local Differential Privacy

Authors: Rong Du, Qingqing Ye, Yaxin Xiao, Liantong Yu, Yue Fu, Haibo Hu
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2504.14993
Pdf URL: https://arxiv.org/pdf/2504.14993
Copy Paste: [[2504.14993]] Dual Utilization of Perturbation for Stream Data Publication under Local Differential Privacy(https://arxiv.org/abs/2504.14993)
Keywords: privacy, robust
Abstract: Stream data from real-time distributed systems such as IoT, tele-health, and crowdsourcing has become an important data source. However, the collection and analysis of user-generated stream data raise privacy concerns due to the potential exposure of sensitive information. To address these concerns, local differential privacy (LDP) has emerged as a promising standard. Nevertheless, applying LDP to stream data presents significant challenges, as stream data often involves a large or even infinite number of values. Allocating a given privacy budget across these data points would introduce overwhelming LDP noise to the original stream data. Beyond existing approaches that merely use perturbed values for estimating statistics, our design leverages them for both perturbation and estimation. This dual utilization arises from a key observation: each user knows their own ground truth and perturbed values, enabling a precise computation of the deviation error caused by perturbation. By incorporating this deviation into the perturbation process of subsequent values, the previous noise can be calibrated. Following this insight, we introduce the Iterative Perturbation Parameterization (IPP) method, which utilizes current perturbed results to calibrate the subsequent perturbation process. To enhance the robustness of calibration and reduce sensitivity, two algorithms, namely Accumulated Perturbation Parameterization (APP) and Clipped Accumulated Perturbation Parameterization (CAPP) are further developed. We prove that these three algorithms satisfy $w$-event differential privacy while significantly improving utility. Experimental results demonstrate that our techniques outperform state-of-the-art LDP stream publishing solutions in terms of utility, while retaining the same privacy guarantee.

Title: Insert Anything: Image Insertion via In-Context Editing in DiT

Authors: Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15009
Pdf URL: https://arxiv.org/pdf/2504.15009
Copy Paste: [[2504.15009]] Insert Anything: Image Insertion via In-Context Editing in DiT(https://arxiv.org/abs/2504.15009)
Keywords: diffusion, transformer
Abstract: This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. Instead of training separate models for individual tasks, our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios. Such a challenging setting requires capturing both identity features and fine-grained details, while allowing versatile local adaptations in style, color, and texture. To this end, we propose to leverage the multimodal attention of the Diffusion Transformer (DiT) to support both mask- and text-guided editing. Furthermore, we introduce an in-context editing mechanism that treats the reference image as contextual information, employing two prompting strategies to harmonize the inserted elements with the target scene while faithfully preserving their distinctive features. Extensive experiments on AnyInsertion, DreamBooth, and VTON-HD benchmarks demonstrate that our method consistently outperforms existing alternatives, underscoring its great potential in real-world applications such as creative content generation, virtual try-on, and scene composition.

Title: Stay Hungry, Stay Foolish: On the Extended Reading Articles Generation with LLMs

Authors: Yow-Fu Liou, Yu-Chien Tang, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15013
Pdf URL: https://arxiv.org/pdf/2504.15013
Copy Paste: [[2504.15013]] Stay Hungry, Stay Foolish: On the Extended Reading Articles Generation with LLMs(https://arxiv.org/abs/2504.15013)
Keywords: large language model
Abstract: The process of creating educational materials is both time-consuming and demanding for educators. This research explores the potential of Large Language Models (LLMs) to streamline this task by automating the generation of extended reading materials and relevant course suggestions. Using the TED-Ed Dig Deeper sections as an initial exploration, we investigate how supplementary articles can be enriched with contextual knowledge and connected to additional learning resources. Our method begins by generating extended articles from video transcripts, leveraging LLMs to include historical insights, cultural examples, and illustrative anecdotes. A recommendation system employing semantic similarity ranking identifies related courses, followed by an LLM-based refinement process to enhance relevance. The final articles are tailored to seamlessly integrate these recommendations, ensuring they remain cohesive and informative. Experimental evaluations demonstrate that our model produces high-quality content and accurate course suggestions, assessed through metrics such as Hit Rate, semantic similarity, and coherence. Our experimental analysis highlight the nuanced differences between the generated and existing materials, underscoring the model's capacity to offer more engaging and accessible learning experiences. This study showcases how LLMs can bridge the gap between core content and supplementary learning, providing students with additional recommended resources while also assisting teachers in designing educational materials.

Title: Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models

Authors: Zijin Yang, Xin Zhang, Kejiang Chen, Kai Zeng, Qiyi Yao, Han Fang, Weiming Zhang, Nenghai Yu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.15026
Pdf URL: https://arxiv.org/pdf/2504.15026
Copy Paste: [[2504.15026]] Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models(https://arxiv.org/abs/2504.15026)
Keywords: protect, attack, robust, extraction, watermark, diffusion
Abstract: Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.

Title: DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Authors: Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15027
Pdf URL: https://arxiv.org/pdf/2504.15027
Copy Paste: [[2504.15027]] DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models(https://arxiv.org/abs/2504.15027)
Keywords: large language model
Abstract: Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.

Title: DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Authors: Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15032
Pdf URL: https://arxiv.org/pdf/2504.15032
Copy Paste: [[2504.15032]] DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation(https://arxiv.org/abs/2504.15032)
Keywords: diffusion, large language model
Abstract: Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving the precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis.

Title: SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

Authors: Yue Li, Weizhi Liu, Dongdong Lin
Subjects: cs.CR, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2504.15035
Pdf URL: https://arxiv.org/pdf/2504.15035
Copy Paste: [[2504.15035]] SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation(https://arxiv.org/abs/2504.15035)
Keywords: security, attack, robust, extraction, watermark, diffusion, generative
Abstract: The accelerated advancement of speech generative models has given rise to security issues, including model infringement and unauthorized abuse of content. Although existing generative watermarking techniques have proposed corresponding solutions, most methods require substantial computational overhead and training costs. In addition, some methods have limitations in robustness when handling variable-length inputs. To tackle these challenges, we propose \textsc{SOLIDO}, a novel generative watermarking method that integrates parameter-efficient fine-tuning with speech watermarking through low-rank adaptation (LoRA) for speech diffusion models. Concretely, the watermark encoder converts the watermark to align with the input of diffusion models. To achieve precise watermark extraction from variable-length inputs, the watermark decoder based on depthwise separable convolution is designed for watermark recovery. To further enhance speech generation performance and watermark extraction capability, we propose a speech-driven lightweight fine-tuning strategy, which reduces computational overhead through LoRA. Comprehensive experiments demonstrate that the proposed method ensures high-fidelity watermarked speech even at a large capacity of 2000 bps. Furthermore, against common individual and compound speech attacks, our SOLIDO achieves a maximum average extraction accuracy of 99.20\% and 98.43\%, respectively. It surpasses other state-of-the-art methods by nearly 23\% in resisting time-stretching attacks.

Title: A Call for New Recipes to Enhance Spatial Reasoning in MLLMs

Authors: Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yan xia, Ivan Vulić, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.15037
Pdf URL: https://arxiv.org/pdf/2504.15037
Copy Paste: [[2504.15037]] A Call for New Recipes to Enhance Spatial Reasoning in MLLMs(https://arxiv.org/abs/2504.15037)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world, thereby limiting their broader applications. We argue that spatial reasoning capabilities will not naturally emerge from merely scaling existing architectures and training methodologies. Instead, this challenge demands dedicated attention to fundamental modifications in the current MLLM development approach. In this position paper, we first establish a comprehensive framework for spatial reasoning within the context of MLLMs. We then elaborate on its pivotal role in real-world applications. Through systematic analysis, we examine how individual components of the current methodology-from training data to reasoning mechanisms-influence spatial reasoning capabilities. This examination reveals critical limitations while simultaneously identifying promising avenues for advancement. Our work aims to direct the AI research community's attention toward these crucial yet underexplored aspects. By highlighting these challenges and opportunities, we seek to catalyze progress toward achieving human-like spatial reasoning capabilities in MLLMs.

Title: RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

Authors: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15047
Pdf URL: https://arxiv.org/pdf/2504.15047
Copy Paste: [[2504.15047]] RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search(https://arxiv.org/abs/2504.15047)
Keywords: attack, large language model
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at this https URL, supporting reproducibility and future research in LLM red-teaming.

Title: ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Authors: Mohamed el amine Boudjoghra, Ivan Laptev, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15049
Pdf URL: https://arxiv.org/pdf/2504.15049
Copy Paste: [[2504.15049]] ScanEdit: Hierarchically-Guided Functional 3D Scan Editing(https://arxiv.org/abs/2504.15049)
Keywords: large language model
Abstract: With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of ob- jectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reason- ing capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph. Fi- nally, ScanEdit integrates LLM-based guidance with ex- plicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent re- sults for a variety of real-world scenes and input instruc- tions.

Title: Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT

Authors: Joachim Minder, Guillaume Wisniewski, Natalie Kübler
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2504.15052
Pdf URL: https://arxiv.org/pdf/2504.15052
Copy Paste: [[2504.15052]] Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT(https://arxiv.org/abs/2504.15052)
Keywords: large language model
Abstract: This study investigates the capabilities of large language models (LLMs), specifically ChatGPT, in annotating MT outputs based on an error typology. In contrast to previous work focusing mainly on general language, we explore ChatGPT's ability to identify and categorise errors in specialised translations. By testing two different prompts and based on a customised error typology, we compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself. The results show that, for translations generated by DeepL, recall and precision are quite high. However, the degree of accuracy in error categorisation depends on the prompt's specific features and its level of detail, ChatGPT performing very well with a detailed prompt. When evaluating its own translations, ChatGPT achieves significantly poorer results, revealing limitations with self-assessment. These results highlight both the potential and the limitations of LLMs for translation evaluation, particularly in specialised domains. Our experiments pave the way for future research on open-source LLMs, which could produce annotations of comparable or even higher quality. In the future, we also aim to test the practical effectiveness of this automated evaluation in the context of translation training, particularly by optimising the process of human evaluation by teachers and by exploring the impact of annotations by LLMs on students' post-editing and translation learning.

Title: Structure-guided Diffusion Transformer for Low-Light Image Enhancement

Authors: Xiangchen Yin, Zhenda Yu, Longtao Jiang, Xin Gao, Xiao Sun, Zhi Liu, Xun Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15054
Pdf URL: https://arxiv.org/pdf/2504.15054
Copy Paste: [[2504.15054]] Structure-guided Diffusion Transformer for Low-Light Image Enhancement(https://arxiv.org/abs/2504.15054)
Keywords: diffusion, transformer
Abstract: While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and design a novel Structure-guided Diffusion Transformer based Low-light image enhancement (SDTL) framework. We compress the feature through wavelet transform to improve the inference efficiency of the model and capture the multi-directional frequency band. Then we propose a Structure Enhancement Module (SEM) that uses structural prior to enhance the texture and leverages an adaptive fusion strategy to achieve more accurate enhancement effect. In Addition, we propose a Structure-guided Attention Block (SAB) to pay more attention to texture-riched tokens and avoid interference from noisy areas in noise prediction. Extensive qualitative and quantitative experiments demonstrate that our method achieves SOTA performance on several popular datasets, validating the effectiveness of SDTL in improving image quality and the potential of DiT in low-light enhancement tasks.

Title: Mining Characteristics of Vulnerable Smart Contracts Across Lifecycle Stages

Authors: Hongli Peng, Xiaoqi Li, Wenkai Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15063
Pdf URL: https://arxiv.org/pdf/2504.15063
Copy Paste: [[2504.15063]] Mining Characteristics of Vulnerable Smart Contracts Across Lifecycle Stages(https://arxiv.org/abs/2504.15063)
Keywords: security
Abstract: Smart contracts are the cornerstone of decentralized applications and financial protocols, which extend the application of digital currency transactions. The applications and financial protocols introduce significant security challenges, resulting in substantial economic losses. Existing solutions predominantly focus on code vulnerabilities within smart contracts, accounting for only 50% of security incidents. Therefore, a more comprehensive study of security issues related to smart contracts is imperative. The existing empirical research realizes the static analysis of smart contracts from the perspective of the lifecycle and gives the corresponding measures for each stage. However, they lack the characteristic analysis of vulnerabilities in each stage and the distinction between the vulnerabilities. In this paper, we present the first empirical study on the security of smart contracts throughout their lifecycle, including deployment and execution, upgrade, and destruction stages. It delves into the security issues at each stage and provides at least seven feature descriptions. Finally, utilizing these seven features, five machine-learning classification models are used to identify vulnerabilities at different stages. The classification results reveal that vulnerable contracts exhibit distinct transaction features and ego network properties at various stages.

Title: Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

Authors: Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2504.15077
Pdf URL: https://arxiv.org/pdf/2504.15077
Copy Paste: [[2504.15077]] Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL(https://arxiv.org/abs/2504.15077)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive capabilities in transforming natural language questions about relational databases into SQL queries. Despite recent improvements, small LLMs struggle to handle questions involving multiple tables and complex SQL patterns under a Zero-Shot Learning (ZSL) setting. Supervised Fine-Tuning (SFT) partially compensate the knowledge deficits in pretrained models but falls short while dealing with queries involving multi-hop reasoning. To bridge this gap, different LLM training strategies to reinforce reasoning capabilities have been proposed, ranging from leveraging a thinking process within ZSL, including reasoning traces in SFT, or adopt Reinforcement Learning (RL) strategies. However, the influence of reasoning on Text2SQL performance is still largely unexplored. This paper investigates to what extent LLM reasoning capabilities influence their Text2SQL performance on four benchmark datasets. To this end, it considers the following LLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT, with and without task-specific reasoning traces; (3) RL, leveraging execution accuracy as primary reward function; (4) SFT+RL, i.e, a two-stage approach that combines SFT and RL. The results show that general-purpose reasoning under ZSL proves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefit from SFT with reasoning much more than larger ones, bridging the gap of their (weaker) model pretraining. RL is generally beneficial across all tested models and datasets, particularly when SQL queries involve multi-hop reasoning and multiple tables. Small LLMs with SFT+RL excel on most complex datasets thanks to a strategic balance between generality of the reasoning process and optimization of the execution accuracy. Thanks to RL, the7B Qwen-Coder-2.5 model performs on par with 100+ Billion ones on the Bird dataset.

Title: Federated Latent Factor Model for Bias-Aware Recommendation with Privacy-Preserving

Authors: Junxiang Gao, Yixin Ran, Jia Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15090
Pdf URL: https://arxiv.org/pdf/2504.15090
Copy Paste: [[2504.15090]] Federated Latent Factor Model for Bias-Aware Recommendation with Privacy-Preserving(https://arxiv.org/abs/2504.15090)
Keywords: secure, privacy, federate
Abstract: A recommender system (RS) aims to provide users with personalized item recommendations, enhancing their overall experience. Traditional RSs collect and process all user data on a central server. However, this centralized approach raises significant privacy concerns, as it increases the risk of data breaches and privacy leakages, which are becoming increasingly unacceptable to privacy-sensitive users. To address these privacy challenges, federated learning has been integrated into RSs, ensuring that user data remains secure. In centralized RSs, the issue of rating bias is effectively addressed by jointly analyzing all users' raw interaction data. However, this becomes a significant challenge in federated RSs, as raw data is no longer accessible due to privacy-preserving constraints. To overcome this problem, we propose a Federated Bias-Aware Latent Factor (FBALF) model. In FBALF, training bias is explicitly incorporated into every local model's loss function, allowing for the effective elimination of rating bias without compromising data privacy. Extensive experiments conducted on three real-world datasets demonstrate that FBALF achieves significantly higher recommendation accuracy compared to other state-of-the-art federated RSs.

Title: Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models

Authors: K. Wong, B. Wu, S. Bulathwela, M. Cukurova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15093
Pdf URL: https://arxiv.org/pdf/2504.15093
Copy Paste: [[2504.15093]] Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models(https://arxiv.org/abs/2504.15093)
Keywords: transformer, large language model
Abstract: Detecting collaborative and problem-solving behaviours from digital traces to interpret students' collaborative problem solving (CPS) competency is a long-term goal in the Artificial Intelligence in Education (AIEd) field. Although multimodal data and advanced models are argued to have the potential to detect complex CPS behaviours, empirical evidence on their value remains limited with some contrasting evidence. In this study, we investigated the potential of multimodal data to improve model performance in diagnosing 78 secondary school students' CPS subskills and indicators in authentic educational settings. In particular, text embeddings from verbal data and acoustic embeddings from audio data were used in a multimodal classification model for CPS diagnosis. Both unimodal and multimodal transformer-based models outperformed traditional models in detecting CPS classes. Although the inclusion of multimodality did not improve the performance of traditional unimodal models, its integration into transformer-based models demonstrated improved performance for diagnosing social-cognitive CPS classes compared to unimodal transformer-based models. Based on the results, the paper argues that multimodality and the selection of a particular modelling technique should not be taken for granted to achieve the best performance in the automated detection of every CPS subskill and indicator. Rather, their value is limited to certain types of CPS indicators, affected by the complexity of the labels, and dependent on the composition of indicators in the dataset. We conclude the paper by discussing the required nuance when considering the value of LLMs and multimodality in automated CPS diagnosis, highlighting the need for human-AI complementarity, and proposing the exploration of relevant model architectures and techniques to improve CPS diagnosis in authentic educational contexts.

Title: VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation

Authors: Mingxia Zhan, Li Zhang, XiaoMeng Chu, Beibei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15095
Pdf URL: https://arxiv.org/pdf/2504.15095
Copy Paste: [[2504.15095]] VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation(https://arxiv.org/abs/2504.15095)
Keywords: diffusion
Abstract: Monocular depth estimation (MDE) aims to predict per-pixel depth values from a single RGB image. Recent advancements have positioned diffusion models as effective MDE tools by framing the challenge as a conditional image generation task. Despite their progress, these methods often struggle with accurately reconstructing distant depths, due largely to the imbalanced distribution of depth values and an over-reliance on spatial-domain features. To overcome these limitations, we introduce VistaDepth, a novel framework that integrates adaptive frequency-domain feature enhancements with an adaptive weight-balancing mechanism into the diffusion process. Central to our approach is the Latent Frequency Modulation (LFM) module, which dynamically refines spectral responses in the latent feature space, thereby improving the preservation of structural details and reducing noisy artifacts. Furthermore, we implement an adaptive weighting strategy that modulates the diffusion loss in real-time, enhancing the model's sensitivity towards distant depth reconstruction. These innovations collectively result in superior depth perception performance across both distance and detail. Experimental evaluations confirm that VistaDepth achieves state-of-the-art performance among diffusion-based MDE techniques, particularly excelling in the accurate reconstruction of distant regions.

Title: Fast-Slow Co-advancing Optimizer: Toward Harmonious Adversarial Training of GAN

Authors: Lin Wang, Xiancheng Wang, Rui Wang, Zhibo Zhang, Minghang Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15099
Pdf URL: https://arxiv.org/pdf/2504.15099
Copy Paste: [[2504.15099]] Fast-Slow Co-advancing Optimizer: Toward Harmonious Adversarial Training of GAN(https://arxiv.org/abs/2504.15099)
Keywords: generative
Abstract: Up to now, the training processes of typical Generative Adversarial Networks (GANs) are still particularly sensitive to data properties and hyperparameters, which may lead to severe oscillations, difficulties in convergence, or even failures to converge, especially when the overall variances of the training sets are large. These phenomena are often attributed to the training characteristics of such networks. Aiming at the problem, this paper develops a new intelligent optimizer, Fast-Slow Co-advancing Optimizer (FSCO), which employs reinforcement learning in the training process of GANs to make training easier. Specifically, this paper allows the training step size to be controlled by an agent to improve training stability, and makes the training process more intelligent with variable learning rates, making GANs less sensitive to step size. Experiments have been conducted on three benchmark datasets to verify the effectiveness of the developed FSCO.

Title: Kuwain 1.5B: An Arabic SLM via Language Injection

Authors: Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15120
Pdf URL: https://arxiv.org/pdf/2504.15120
Copy Paste: [[2504.15120]] Kuwain 1.5B: An Arabic SLM via Language Injection(https://arxiv.org/abs/2504.15120)
Keywords: large language model
Abstract: Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.

Title: Robust and Real-time Surface Normal Estimation from Stereo Disparities using Affine Transformations

Authors: Csongor Csanad Kariko, Muhammad Rafi Faisal, Levente Hajder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15121
Pdf URL: https://arxiv.org/pdf/2504.15121
Copy Paste: [[2504.15121]] Robust and Real-time Surface Normal Estimation from Stereo Disparities using Affine Transformations(https://arxiv.org/abs/2504.15121)
Keywords: robust
Abstract: This work introduces a novel method for surface normal estimation from rectified stereo image pairs, leveraging affine transformations derived from disparity values to achieve fast and accurate results. We demonstrate how the rectification of stereo image pairs simplifies the process of surface normal estimation by reducing computational complexity. To address noise reduction, we develop a custom algorithm inspired by convolutional operations, tailored to process disparity data efficiently. We also introduce adaptive heuristic techniques for efficiently detecting connected surface components within the images, further improving the robustness of the method. By integrating these methods, we construct a surface normal estimator that is both fast and accurate, producing a dense, oriented point cloud as the final output. Our method is validated using both simulated environments and real-world stereo images from the Middlebury and Cityscapes datasets, demonstrating significant improvements in real-time performance and accuracy when implemented on a GPU. Upon acceptance, the shader source code will be made publicly available to facilitate further research and reproducibility.

Title: MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video

Authors: Minh-Quan Viet Bui, Jongmin Park, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15122
Pdf URL: https://arxiv.org/pdf/2504.15122
Copy Paste: [[2504.15122]] MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video(https://arxiv.org/abs/2504.15122)
Keywords: robust
Abstract: We present MoBGS, a novel deblurring dynamic 3D Gaussian Splatting (3DGS) framework capable of reconstructing sharp and high-quality novel spatio-temporal views from blurry monocular videos in an end-to-end manner. Existing dynamic novel view synthesis (NVS) methods are highly sensitive to motion blur in casually captured videos, resulting in significant degradation of rendering quality. While recent approaches address motion-blurred inputs for NVS, they primarily focus on static scene reconstruction and lack dedicated motion modeling for dynamic objects. To overcome these limitations, our MoBGS introduces a novel Blur-adaptive Latent Camera Estimation (BLCE) method for effective latent camera trajectory estimation, improving global camera motion deblurring. In addition, we propose a physically-inspired Latent Camera-induced Exposure Estimation (LCEE) method to ensure consistent deblurring of both global camera and local object motion. Our MoBGS framework ensures the temporal consistency of unseen latent timestamps and robust motion decomposition of static and dynamic regions. Extensive experiments on the Stereo Blur dataset and real-world blurry videos show that our MoBGS significantly outperforms the very recent advanced methods (DyBluRF and Deblur4DGS), achieving state-of-the-art performance for dynamic NVS under motion blur.

Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Authors: Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15133
Pdf URL: https://arxiv.org/pdf/2504.15133
Copy Paste: [[2504.15133]] EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models(https://arxiv.org/abs/2504.15133)
Keywords: large language model
Abstract: In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at this https URL along with a demonstration notebook. In addition, we provide a demo video at this https URL for a quick introduction.

Title: GIFDL: Generated Image Fluctuation Distortion Learning for Enhancing Steganographic Security

Authors: Xiangkun Wang, Kejiang Chen, Yuang Qi, Ruiheng Liu, Weiming Zhang, Nenghai Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.15139
Pdf URL: https://arxiv.org/pdf/2504.15139
Copy Paste: [[2504.15139]] GIFDL: Generated Image Fluctuation Distortion Learning for Enhancing Steganographic Security(https://arxiv.org/abs/2504.15139)
Keywords: security
Abstract: Minimum distortion steganography is currently the mainstream method for modification-based steganography. A key issue in this method is how to define steganographic distortion. With the rapid development of deep learning technology, the definition of distortion has evolved from manual design to deep learning design. Concurrently, rapid advancements in image generation have made generated images viable as cover media. However, existing distortion design methods based on machine learning do not fully leverage the advantages of generated cover media, resulting in suboptimal security performance. To address this issue, we propose GIFDL (Generated Image Fluctuation Distortion Learning), a steganographic distortion learning method based on the fluctuations in generated images. Inspired by the idea of natural steganography, we take a series of highly similar fluctuation images as the input to the steganographic distortion generator and introduce a new GAN training strategy to disguise stego images as fluctuation images. Experimental results demonstrate that GIFDL, compared with state-of-the-art GAN-based distortion learning methods, exhibits superior resistance to steganalysis, increasing the detection error rates by an average of 3.30% across three steganalyzers.

Title: Landmark-Free Preoperative-to-Intraoperative Registration in Laparoscopic Liver Resection

Authors: Jun Zhou, Bingchen Gao, Kai Wang, Jialun Pei, Pheng-Ann Heng, Jing Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15152
Pdf URL: https://arxiv.org/pdf/2504.15152
Copy Paste: [[2504.15152]] Landmark-Free Preoperative-to-Intraoperative Registration in Laparoscopic Liver Resection(https://arxiv.org/abs/2504.15152)
Keywords: robust, transformer
Abstract: Liver registration by overlaying preoperative 3D models onto intraoperative 2D frames can assist surgeons in perceiving the spatial anatomy of the liver clearly for a higher surgical success rate. Existing registration methods rely heavily on anatomical landmark-based workflows, which encounter two major limitations: 1) ambiguous landmark definitions fail to provide efficient markers for registration; 2) insufficient integration of intraoperative liver visual information in shape deformation modeling. To address these challenges, in this paper, we propose a landmark-free preoperative-to-intraoperative registration framework utilizing effective self-supervised learning, termed \ourmodel. This framework transforms the conventional 3D-2D workflow into a 3D-3D registration pipeline, which is then decoupled into rigid and non-rigid registration subtasks. \ourmodel~first introduces a feature-disentangled transformer to learn robust correspondences for recovering rigid transformations. Further, a structure-regularized deformation network is designed to adjust the preoperative model to align with the intraoperative liver surface. This network captures structural correlations through geometry similarity modeling in a low-rank transformer network. To facilitate the validation of the registration performance, we also construct an in-vivo registration dataset containing liver resection videos of 21 patients, called \emph{P2I-LReg}, which contains 346 keyframes that provide a global view of the liver together with liver mask annotations and calibrated camera intrinsic parameters. Extensive experiments and user studies on both synthetic and in-vivo datasets demonstrate the superiority and potential clinical applicability of our method.

Title: Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Authors: Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, Zhenyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15159
Pdf URL: https://arxiv.org/pdf/2504.15159
Copy Paste: [[2504.15159]] Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration(https://arxiv.org/abs/2504.15159)
Keywords: privacy, diffusion, transformer, generative
Abstract: Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.

Title: The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks

Authors: Joan C. Timoneda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15160
Pdf URL: https://arxiv.org/pdf/2504.15160
Copy Paste: [[2504.15160]] The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks(https://arxiv.org/abs/2504.15160)
Keywords: generative, large language model
Abstract: Encoder-decoder Large Language Models (LLMs), such as BERT and RoBERTa, require that all categories in an annotation task be sufficiently represented in the training data for optimal performance. However, it is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. In this article, I describe this problem and propose a solution, the synthetic imputation approach. Leveraging a generative LLM (GPT-4o), this approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. This approach ensures that new synthetic texts are sufficiently different from the original texts to reduce overfitting, but retain the underlying substantive meaning of the examples to maximize out-of-sample performance. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples. The synthetic imputation approach provides a novel role for generative LLMs in research and allows applied researchers to balance their datasets for best performance.

Title: Survey of Loss Augmented Knowledge Tracing

Authors: Altun Shukurlu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.15163
Pdf URL: https://arxiv.org/pdf/2504.15163
Copy Paste: [[2504.15163]] Survey of Loss Augmented Knowledge Tracing(https://arxiv.org/abs/2504.15163)
Keywords: robust
Abstract: The training of artificial neural networks is heavily dependent on the careful selection of an appropriate loss function. While commonly used loss functions, such as cross-entropy and mean squared error (MSE), generally suffice for a broad range of tasks, challenges often emerge due to limitations in data quality or inefficiencies within the learning process. In such circumstances, the integration of supplementary terms into the loss function can serve to address these challenges, enhancing both model performance and robustness. Two prominent techniques, loss regularization and contrastive learning, have been identified as effective strategies for augmenting the capacity of loss functions in artificial neural networks. Knowledge tracing is a compelling area of research that leverages predictive artificial intelligence to facilitate the automation of personalized and efficient educational experiences for students. In this paper, we provide a comprehensive review of the deep learning-based knowledge tracing (DKT) algorithms trained using advanced loss functions and discuss their improvements over prior techniques. We discuss contrastive knowledge tracing algorithms, such as Bi-CLKT, CL4KT, SP-CLKT, CoSKT, and prediction-consistent DKT, providing performance benchmarks and insights into real-world deployment challenges. The survey concludes with future research directions, including hybrid loss strategies and context-aware modeling.

Title: An Efficient Aerial Image Detection with Variable Receptive Fields

Authors: Liu Wenbin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15165
Pdf URL: https://arxiv.org/pdf/2504.15165
Copy Paste: [[2504.15165]] An Efficient Aerial Image Detection with Variable Receptive Fields(https://arxiv.org/abs/2504.15165)
Keywords: transformer
Abstract: Aerial object detection using unmanned aerial vehicles (UAVs) faces critical challenges including sub-10px targets, dense occlusions, and stringent computational constraints. Existing detectors struggle to balance accuracy and efficiency due to rigid receptive fields and redundant architectures. To address these limitations, we propose Variable Receptive Field DETR (VRF-DETR), a transformer-based detector incorporating three key components: 1) Multi-Scale Context Fusion (MSCF) module that dynamically recalibrates features through adaptive spatial attention and gated multi-scale fusion, 2) Gated Convolution (GConv) layer enabling parameter-efficient local-context modeling via depthwise separable operations and dynamic gating, and 3) Gated Multi-scale Fusion (GMCF) Bottleneck that hierarchically disentangles occluded objects through cascaded global-local interactions. Experiments on VisDrone2019 demonstrate VRF-DETR achieves 51.4\% mAP\textsubscript{50} and 31.8\% mAP\textsubscript{50:95} with only 13.5M parameters. This work establishes a new efficiency-accuracy Pareto frontier for UAV-based detection tasks.

Title: Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture

Authors: Meng Cui, Xianghu Yue, Xinyuan Qian, Jinzheng Zhao, Haohe Liu, Xubo Liu, Daoliang Li, Wenwu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.15171
Pdf URL: https://arxiv.org/pdf/2504.15171
Copy Paste: [[2504.15171]] Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture(https://arxiv.org/abs/2504.15171)
Keywords: robust
Abstract: Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achieves exemplar-free efficiency while preserving essential knowledge through compact feature representations. Specifically, HAIL-FFIA employs hierarchical representation learning with a dual-path knowledge preservation mechanism that separates general intensity knowledge from fish-specific characteristics. Additionally, it features a dynamic modality balancing system that adaptively adjusts the importance of audio versus visual information based on feeding behaviour stages. Experimental results show that HAIL-FFIA is superior to SOTA methods on AV-CIL-FFIA, achieving higher accuracy with lower storage needs while effectively mitigating catastrophic forgetting in incremental fish species learning.

Title: DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution

Authors: Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15176
Pdf URL: https://arxiv.org/pdf/2504.15176
Copy Paste: [[2504.15176]] DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution(https://arxiv.org/abs/2504.15176)
Keywords: diffusion, large language model
Abstract: Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation. To this end, we are the first to introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks to effectively enhance the alignment of generated outputs with human preferences. Specifically, we introduce Direct Preference Optimization (DPO) into Real-ISR to achieve alignment, where DPO serves as a general alignment technique that directly learns from the human preference dataset. Nevertheless, unlike high-level tasks, the pixel-level reconstruction objectives of Real-ISR are difficult to reconcile with the image-level preferences of DPO, which can lead to the DPO being overly sensitive to local anomalies, leading to reduced generation quality. To resolve this dichotomy, we propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance, which is through two strategies: (a) semantic instance alignment strategy, implementing instance-level alignment to ensure fine-grained perceptual consistency, and (b) user description feedback strategy, mitigating hallucinations through semantic textual feedback on instance-level images. As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.

Title: FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Authors: Fei Yin, Mallikarjun B R, Chun-Han Yao, Rafał Mantiuk, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15179
Pdf URL: https://arxiv.org/pdf/2504.15179
Copy Paste: [[2504.15179]] FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image(https://arxiv.org/abs/2504.15179)
Keywords: diffusion
Abstract: We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.

Title: Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform

Authors: Xianpan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15182
Pdf URL: https://arxiv.org/pdf/2504.15182
Copy Paste: [[2504.15182]] Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform(https://arxiv.org/abs/2504.15182)
Keywords: generative
Abstract: The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: this https URL

Title: Automated Measurement of Eczema Severity with Self-Supervised Learning

Authors: Neelesh Kumar, Oya Aran
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15193
Pdf URL: https://arxiv.org/pdf/2504.15193
Copy Paste: [[2504.15193]] Automated Measurement of Eczema Severity with Self-Supervised Learning(https://arxiv.org/abs/2504.15193)
Keywords: extraction, transformer, segmentation
Abstract: Automated diagnosis of eczema using images acquired from digital camera can enable individuals to self-monitor their recovery. The process entails first segmenting out the eczema region from the image and then measuring the severity of eczema in the segmented region. The state-of-the-art methods for automated eczema diagnosis rely on deep neural networks such as convolutional neural network (CNN) and have shown impressive performance in accurately measuring the severity of eczema. However, these methods require massive volume of annotated data to train which can be hard to obtain. In this paper, we propose a self-supervised learning framework for automated eczema diagnosis under limited training data regime. Our framework consists of two stages: i) Segmentation, where we use an in-context learning based algorithm called SegGPT for few-shot segmentation of eczema region from the image; ii) Feature extraction and classification, where we extract DINO features from the segmented regions and feed it to a multi-layered perceptron (MLP) for 4-class classification of eczema severity. When evaluated on a dataset of annotated "in-the-wild" eczema images, we show that our method outperforms (Weighted F1: 0.67 $\pm$ 0.01) the state-of-the-art deep learning methods such as finetuned Resnet-18 (Weighted F1: 0.44 $\pm$ 0.16) and Vision Transformer (Weighted F1: 0.40 $\pm$ 0.22). Our results show that self-supervised learning can be a viable solution for automated skin diagnosis where labeled data is scarce.

Title: Extending the ElGamal Cryptosystem to the Third Group of Units of $\Z_{n}$

Authors: Jana Hamza, Mohammad EL Hindi, Seifeddine Kadri, Therrar Kadri, Yahya Awad
Subjects: cs.CR, cs.IT, math.GR, math.PR, math.RA
Abstract URL: https://arxiv.org/abs/2504.15202
Pdf URL: https://arxiv.org/pdf/2504.15202
Copy Paste: [[2504.15202]] Extending the ElGamal Cryptosystem to the Third Group of Units of $\Z_{n}$(https://arxiv.org/abs/2504.15202)
Keywords: secure, security
Abstract: In this paper, we extend the ElGamal cryptosystem to the third group of units of the ring $\Z_{n}$, which we prove to be more secure than the previous extensions. We describe the arithmetic needed in the new setting. We also provide some numerical simulations that shows the security and efficiency of our proposed cryptosystem.

Title: Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Authors: Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.15205
Pdf URL: https://arxiv.org/pdf/2504.15205
Copy Paste: [[2504.15205]] Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges(https://arxiv.org/abs/2504.15205)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

Title: How Global Calibration Strengthens Multiaccuracy

Authors: Sílvia Casacuberta, Parikshit Gopalan, Varun Kanade, Omer Reingold
Subjects: cs.LG, cs.CC
Abstract URL: https://arxiv.org/abs/2504.15206
Pdf URL: https://arxiv.org/pdf/2504.15206
Copy Paste: [[2504.15206]] How Global Calibration Strengthens Multiaccuracy(https://arxiv.org/abs/2504.15206)
Keywords: fair
Abstract: Multiaccuracy and multicalibration are multigroup fairness notions for prediction that have found numerous applications in learning and computational complexity. They can be achieved from a single learning primitive: weak agnostic learning. Here we investigate the power of multiaccuracy as a learning primitive, both with and without the additional assumption of calibration. We find that multiaccuracy in itself is rather weak, but that the addition of global calibration (this notion is called calibrated multiaccuracy) boosts its power substantially, enough to recover implications that were previously known only assuming the stronger notion of multicalibration. We give evidence that multiaccuracy might not be as powerful as standard weak agnostic learning, by showing that there is no way to post-process a multiaccurate predictor to get a weak learner, even assuming the best hypothesis has correlation $1/2$. Rather, we show that it yields a restricted form of weak agnostic learning, which requires some concept in the class to have correlation greater than $1/2$ with the labels. However, by also requiring the predictor to be calibrated, we recover not just weak, but strong agnostic learning. A similar picture emerges when we consider the derivation of hardcore measures from predictors satisfying multigroup fairness notions. On the one hand, while multiaccuracy only yields hardcore measures of density half the optimal, we show that (a weighted version of) calibrated multiaccuracy achieves optimal density. Our results yield new insights into the complementary roles played by multiaccuracy and calibration in each setting. They shed light on why multiaccuracy and global calibration, although not particularly powerful by themselves, together yield considerably stronger notions.

Title: Compute-Optimal LLMs Provably Generalize Better With Scale

Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15208
Pdf URL: https://arxiv.org/pdf/2504.15208
Copy Paste: [[2504.15208]] Compute-Optimal LLMs Provably Generalize Better With Scale(https://arxiv.org/abs/2504.15208)
Keywords: large language model
Abstract: Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Title: A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Authors: Xin Liao, Bing Yang, Tan Dongli, Cai Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15209
Pdf URL: https://arxiv.org/pdf/2504.15209
Copy Paste: [[2504.15209]] A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data(https://arxiv.org/abs/2504.15209)
Keywords: protect
Abstract: The monitoring of water quality is a crucial part of environmental protection, and a large number of monitors are widely deployed to monitor water quality. Due to unavoidable factors such as data acquisition breakdowns, sensors and communication failures, water quality monitoring data suffers from missing values over time, resulting in High-Dimensional and Sparse (HDS) Water Quality Data (WQD). The simple and rough filling of the missing values leads to inaccurate results and affects the implementation of relevant measures. Therefore, this paper proposes a Causal convolutional Low-rank Representation (CLR) model for imputing missing WQD to improve the completeness of the WQD, which employs a two-fold idea: a) applying causal convolutional operation to consider the temporal dependence of the low-rank representation, thus incorporating temporal information to improve the imputation accuracy; and b) implementing a hyperparameters adaptation scheme to automatically adjust the best hyperparameters during model training, thereby reducing the tedious manual adjustment of hyper-parameters. Experimental studies on three real-world water quality datasets demonstrate that the proposed CLR model is superior to some of the existing state-of-the-art imputation models in terms of imputation accuracy and time cost, as well as indicating that the proposed model provides more reliable decision support for environmental monitoring.

Title: EvalAgent: Discovering Implicit Evaluation Criteria from the Web

Authors: Manya Wadhwa, Zayne Sprague, Chaitanya Malaviya, Philippe Laban, Junyi Jessy Li, Greg Durrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15219
Pdf URL: https://arxiv.org/pdf/2504.15219
Copy Paste: [[2504.15219]] EvalAgent: Discovering Implicit Evaluation Criteria from the Web(https://arxiv.org/abs/2504.15219)
Keywords: large language model
Abstract: Evaluation of language model outputs on structured writing tasks is typically conducted with a number of desirable criteria presented to human evaluators or large language models (LLMs). For instance, on a prompt like "Help me draft an academic talk on coffee intake vs research productivity", a model response may be evaluated for criteria like accuracy and coherence. However, high-quality responses should do more than just satisfy basic task requirements. An effective response to this query should include quintessential features of an academic talk, such as a compelling opening, clear research questions, and a takeaway. To help identify these implicit criteria, we introduce EvalAgent, a novel framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent first mines expert-authored online guidance. It then uses this evidence to propose diverse, long-tail evaluation criteria that are grounded in reliable external sources. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit (not directly stated in the user's prompt), yet specific (high degree of lexical precision). Further, EvalAgent criteria are often not satisfied by initial responses but they are actionable, such that responses can be refined to satisfy them. Finally, we show that combining LLM-generated and EvalAgent criteria uncovers more human-valued criteria than using LLMs alone.

Title: A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention

Authors: Tao Yang, Yu Cheng, Yaokun Ren, Yujia Lou, Minggu Wei, Honghui Xin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.15223
Pdf URL: https://arxiv.org/pdf/2504.15223
Copy Paste: [[2504.15223]] A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention(https://arxiv.org/abs/2504.15223)
Keywords: robust
Abstract: This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data. A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a multi-scale attention mechanism. The BiLSTM captures both forward and backward dependencies in sequences, enhancing the model's ability to perceive global contextual structures. At the same time, the multi-scale attention module assigns adaptive weights to key feature regions under different window sizes. This improves the model's responsiveness to both local and global important information. Extensive experiments are conducted on a publicly available multivariate time series dataset. The proposed model is compared with several mainstream sequence modeling methods. Results show that it outperforms existing models in terms of accuracy, precision, and recall. This confirms the effectiveness and robustness of the proposed architecture in complex pattern recognition tasks. Further ablation studies and sensitivity analyses are carried out to investigate the effects of attention scale and input sequence length on model performance. These results provide empirical support for structural optimization of the model.

Title: A Review on Privacy in DAG-Based DLTs

Authors: Mayank Raikwar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.15233
Pdf URL: https://arxiv.org/pdf/2504.15233
Copy Paste: [[2504.15233]] A Review on Privacy in DAG-Based DLTs(https://arxiv.org/abs/2504.15233)
Keywords: privacy
Abstract: Directed Acyclic Graph (DAG)-based Distributed Ledger Technologies (DLTs) have emerged as a promising solution to the scalability issues inherent in traditional blockchains. However, amidst the focus on scalability, the crucial aspect of privacy within DAG-based DLTs has been largely overlooked. This paper seeks to address this gap by providing a comprehensive examination of privacy notions and challenges within DAG-based DLTs. We delve into potential methodologies to enhance privacy within these systems, while also analyzing the associated hurdles and real-world implementations within state-of-the-art DAG-based DLTs. By exploring these methodologies, we not only illuminate the current landscape of privacy in DAG-based DLTs but also outline future research directions in this evolving field.

Title: Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions

Authors: Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, Deep Ganguli
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15236
Pdf URL: https://arxiv.org/pdf/2504.15236
Copy Paste: [[2504.15236]] Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions(https://arxiv.org/abs/2504.15236)
Keywords: privacy
Abstract: AI assistants can impart value judgments that shape people's decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like "moral nihilism". While some values appear consistently across contexts (e.g. "transparency"), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, "harm prevention" emerges when Claude resists users, "historical accuracy" when responding to queries about controversial events, "healthy boundaries" when asked for relationship advice, and "human agency" in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.

Title: Conformalized-KANs: Uncertainty Quantification with Coverage Guarantees for Kolmogorov-Arnold Networks (KANs) in Scientific Machine Learning

Authors: Amirhossein Mollaali, Christian Bolivar Moya, Amanda A. Howard, Alexander Heinlein, Panos Stinis, Guang Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.15240
Pdf URL: https://arxiv.org/pdf/2504.15240
Copy Paste: [[2504.15240]] Conformalized-KANs: Uncertainty Quantification with Coverage Guarantees for Kolmogorov-Arnold Networks (KANs) in Scientific Machine Learning(https://arxiv.org/abs/2504.15240)
Keywords: robust, interpretability
Abstract: This paper explores uncertainty quantification (UQ) methods in the context of Kolmogorov-Arnold Networks (KANs). We apply an ensemble approach to KANs to obtain a heuristic measure of UQ, enhancing interpretability and robustness in modeling complex functions. Building on this, we introduce Conformalized-KANs, which integrate conformal prediction, a distribution-free UQ technique, with KAN ensembles to generate calibrated prediction intervals with guaranteed coverage. Extensive numerical experiments are conducted to evaluate the effectiveness of these methods, focusing particularly on the robustness and accuracy of the prediction intervals under various hyperparameter settings. We show that the conformal KAN predictions can be applied to recent extensions of KANs, including Finite Basis KANs (FBKANs) and multifideilty KANs (MFKANs). The results demonstrate the potential of our approaches to improve the reliability and applicability of KANs in scientific machine learning.

Title: MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

Authors: Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15241
Pdf URL: https://arxiv.org/pdf/2504.15241
Copy Paste: [[2504.15241]] MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning(https://arxiv.org/abs/2504.15241)
Keywords: attack, large language model
Abstract: Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we propose an approach to build a multilingual guardrail with reasoning. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-guided Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail consistently outperforms recent baselines across both in-domain and out-of-domain languages. The multilingual reasoning capability of our guardrail enables it to generate multilingual explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.

Title: Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints

Authors: Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.15243
Pdf URL: https://arxiv.org/pdf/2504.15243
Copy Paste: [[2504.15243]] Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints(https://arxiv.org/abs/2504.15243)
Keywords: fair
Abstract: Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a {\bf hinge-based penalty}, which permits the use of a constant penalty parameter, enabling us to achieve a {\bf state-of-the-art complexity} for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

Title: A Refreshment Stirred, Not Shaken (III): Can Swapping Be Differentially Private?

Authors: James Bailie, Ruobin Gong, Xiao-Li Meng
Subjects: cs.CR, stat.OT
Abstract URL: https://arxiv.org/abs/2504.15246
Pdf URL: https://arxiv.org/pdf/2504.15246
Copy Paste: [[2504.15246]] A Refreshment Stirred, Not Shaken (III): Can Swapping Be Differentially Private?(https://arxiv.org/abs/2504.15246)
Keywords: privacy
Abstract: The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)$\unicode{x2014}$all without shaking its core foundations. Indeed, given the existence of more than 200 formulations of DP (and counting), before even attempting to answer the titular question one must first precisely specify what it actually means to be DP. Motivated by this observation, a theoretical investigation into DP's fundamental essence resulted in Part I of this trio, which introduces a five-building-block system explicating the who, where, what, how and how much aspects of DP. Instantiating this system in the context of the United States Decennial Census, Part II then demonstrates the broader applicability and relevance of DP by comparing a swapping strategy like that used in 2010 with the TopDown Algorithm$\unicode{x2014}$a DP method adopted in the 2020 Census. This paper provides nontechnical summaries of the preceding two parts as well as new discussion$\unicode{x2014}$for example, on how greater awareness of the five building blocks can thwart privacy theatrics; how our results bridging traditional SDC and DP allow a data custodian to reap the benefits of both these fields; how invariants impact disclosure risk; and how removing the implicit reliance on aleatoric uncertainty could lead to new generalizations of DP.

Title: Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Authors: Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15253
Pdf URL: https://arxiv.org/pdf/2504.15253
Copy Paste: [[2504.15253]] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators(https://arxiv.org/abs/2504.15253)
Keywords: generative, large language model
Abstract: Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

Title: Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation

Authors: Yunxuan Cai, Sitao Xiang, Zongjian Li, Haiwei Chen, Yajie Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15259
Pdf URL: https://arxiv.org/pdf/2504.15259
Copy Paste: [[2504.15259]] Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation(https://arxiv.org/abs/2504.15259)
Keywords: diffusion, generative
Abstract: Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.

Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.15266
Pdf URL: https://arxiv.org/pdf/2504.15266
Copy Paste: [[2504.15266]] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction(https://arxiv.org/abs/2504.15266)
Keywords: diffusion, transformer
Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; comparatively, multi-token approaches, namely teacherless training and diffusion models, excel in producing diverse and original output. Secondly, in our tasks, we find that to elicit randomness from the Transformer without hurting coherence, it is better to inject noise right at the input layer (via a method we dub hash-conditioning) rather than defer to temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and softmax-based sampling. We make part of the code available under this https URL

Title: Diffusion Bridge Models for 3D Medical Image Translation

Authors: Shaorong Zhang, Tamoghna Chattopadhyay, Sophia I. Thomopoulos, Jose-Luis Ambite, Paul M. Thompson, Greg Ver Steeg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15267
Pdf URL: https://arxiv.org/pdf/2504.15267
Copy Paste: [[2504.15267]] Diffusion Bridge Models for 3D Medical Image Translation(https://arxiv.org/abs/2504.15267)
Keywords: diffusion
Abstract: Diffusion tensor imaging (DTI) provides crucial insights into the microstructure of the human brain, but it can be time-consuming to acquire compared to more readily available T1-weighted (T1w) magnetic resonance imaging (MRI). To address this challenge, we propose a diffusion bridge model for 3D brain image translation between T1w MRI and DTI modalities. Our model learns to generate high-quality DTI fractional anisotropy (FA) images from T1w images and vice versa, enabling cross-modality data augmentation and reducing the need for extensive DTI acquisition. We evaluate our approach using perceptual similarity, pixel-level agreement, and distributional consistency metrics, demonstrating strong performance in capturing anatomical structures and preserving information on white matter integrity. The practical utility of the synthetic data is validated through sex classification and Alzheimer's disease classification tasks, where the generated images achieve comparable performance to real data. Our diffusion bridge model offers a promising solution for improving neuroimaging datasets and supporting clinical decision-making, with the potential to significantly impact neuroimaging research and clinical practice.

Title: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15271
Pdf URL: https://arxiv.org/pdf/2504.15271
Copy Paste: [[2504.15271]] Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models(https://arxiv.org/abs/2504.15271)
Keywords: robust
Abstract: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

Title: VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Authors: Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15279
Pdf URL: https://arxiv.org/pdf/2504.15279
Copy Paste: [[2504.15279]] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models(https://arxiv.org/abs/2504.15279)
Keywords: large language model
Abstract: Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.

Title: Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Authors: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.15280
Pdf URL: https://arxiv.org/pdf/2504.15280
Copy Paste: [[2504.15280]] Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs(https://arxiv.org/abs/2504.15280)
Keywords: large language model
Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at this https URL.

Title: StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

Authors: Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15281
Pdf URL: https://arxiv.org/pdf/2504.15281
Copy Paste: [[2504.15281]] StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians(https://arxiv.org/abs/2504.15281)
Keywords: diffusion
Abstract: 3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.