2025-02-24

Title: KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models

Authors: Dong Chen, Zhengqing Hu, Peiguang Fan, Yueting Zhuang, Yafei Li, Qidong Liu, Xiaoheng Jiang, Mingliang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14880
Pdf URL: https://arxiv.org/pdf/2502.14880
Copy Paste: [[2502.14880]] KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models(https://arxiv.org/abs/2502.14880)
Keywords: large language model
Abstract: Vision anomaly detection, particularly in unsupervised settings, often struggles to distinguish between normal samples and anomalies due to the wide variability in anomalies. Recently, an increasing number of studies have focused on generating anomalies to help detectors learn more effective boundaries between normal samples and anomalies. However, as the generated anomalies are often derived from random factors, they frequently lack realism. Additionally, randomly generated anomalies typically offer limited support in constructing effective boundaries, as most differ substantially from normal samples and lie far from the boundary. To address these challenges, we propose Key Knowledge Augmentation (KKA), a method that extracts anomaly-related knowledge from large language models (LLMs). More specifically, KKA leverages the extensive prior knowledge of LLMs to generate meaningful anomalies based on normal samples. Then, KKA classifies the generated anomalies as easy anomalies and hard anomalies according to their similarity to normal samples. Easy anomalies exhibit significant differences from normal samples, whereas hard anomalies closely resemble normal samples. KKA iteratively updates the generated anomalies, and gradually increasing the proportion of hard anomalies to enable the detector to learn a more effective boundary. Experimental results show that the proposed method significantly improves the performance of various vision anomaly detectors while maintaining low generation costs. The code for CMG can be found at this https URL.

Title: A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations

Authors: Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, Dacheng Tao
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14881
Pdf URL: https://arxiv.org/pdf/2502.14881
Copy Paste: [[2502.14881]] A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations(https://arxiv.org/abs/2502.14881)
Keywords: secure, security, defense, attack, robust
Abstract: With the rapid advancement of Large Vision-Language Models (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: this https URL .

Title: From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models

Authors: Zeliang Zhang, Yifan Zhu, Susan Liang, Zhiyuan Wang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.14882
Pdf URL: https://arxiv.org/pdf/2502.14882
Copy Paste: [[2502.14882]] From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models(https://arxiv.org/abs/2502.14882)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily focus on dropping unimportant tokens to reduce the KV cache size, mitigating memory constraints at the cost of potential information loss. In contrast, we propose a simple yet effective visual quantization strategy that preserves all visual tokens while significantly reducing memory consumption. To achieve an extreme quantization ratio, i.e., 1-bit quantization, we propose group-specific quantization and quantile-based quantization approaches, motivated by the inherent patterns of the KV cache. Our method is plug-and-play, enabling seamless integration into various MLLMs to improve memory efficiency without architectural modifications. Extensive experiments demonstrate that our approach effectively reduces memory overhead while maintaining computational efficiency and preserving multimodal performance.

Title: SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image

Authors: Qian Jin, Yuqi Jiang, Xudong Lu, Yumeng Liu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14884
Pdf URL: https://arxiv.org/pdf/2502.14884
Copy Paste: [[2502.14884]] SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image(https://arxiv.org/abs/2502.14884)
Keywords: segmentation
Abstract: In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, labels, and poor transferability. In this paper, we propose a novel few-shot learning approach, SEM-CLIP, for accurate defect classification and segmentation. SEM-CLIP customizes the Contrastive Language-Image Pretraining (CLIP) model to better focus on defect areas and minimize background distractions, thereby enhancing segmentation accuracy. We employ text prompts enriched with domain knowledge as prior information to assist in precise analysis. Additionally, our approach incorporates feature engineering with textual guidance to categorize defects more effectively. SEM-CLIP requires little annotated data, substantially reducing labor demands in the semiconductor industry. Extensive experimental validation demonstrates that our model achieves impressive classification and segmentation results under few-shot learning scenarios.

Title: Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

Authors: Ufaq Khan, Umair Nawaz, Adnan Qayyum, Shazad Ashraf, Muhammad Bilal, Junaid Qadir
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.14886
Pdf URL: https://arxiv.org/pdf/2502.14886
Copy Paste: [[2502.14886]] Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review(https://arxiv.org/abs/2502.14886)
Keywords: transformer, segmentation
Abstract: Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of foundational models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and foundational models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical endoscopic video analysis. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.

Title: Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

Authors: Weilin Ruan, Siru Zhong, Haomin Wen, Yuxuan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14887
Pdf URL: https://arxiv.org/pdf/2502.14887
Copy Paste: [[2502.14887]] Vision-Enhanced Time Series Forecasting via Latent Diffusion Models(https://arxiv.org/abs/2502.14887)
Keywords: extraction, diffusion
Abstract: Diffusion models have recently emerged as powerful frameworks for generating high-quality images. While recent studies have explored their application to time series forecasting, these approaches face significant challenges in cross-modal modeling and transforming visual information effectively to capture temporal patterns. In this paper, we propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting. Instead of introducing external visual data, we are the first to use complementary transformation techniques to convert time series into multi-view visual representations, allowing the model to exploit the rich feature extraction capabilities of the pre-trained vision encoder. Subsequently, these representations are reconstructed using a latent diffusion model with a cross-modal conditioning mechanism as well as a fusion module. Experimental results demonstrate that LDM4TS outperforms various specialized forecasting models for time series forecasting tasks.

Title: The Multi-Faceted Monosemanticity in Multimodal Representations

Authors: Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, Yifei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14888
Pdf URL: https://arxiv.org/pdf/2502.14888
Copy Paste: [[2502.14888]] The Multi-Faceted Monosemanticity in Multimodal Representations(https://arxiv.org/abs/2502.14888)
Keywords: defense, attack, interpretability
Abstract: In this paper, we leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models, offering a data-driven understanding of modality gaps. Specifically, we investigate CLIP (Contrastive Language-Image Pretraining), a prominent visual-language representation model trained on extensive image-text pairs. Building upon interpretability tools developed for single-modal models, we extend these methodologies to assess multi-modal interpretability of CLIP features. Additionally, we introduce the Modality Dominance Score (MDS) to attribute the interpretability of each feature to its respective modality. Next, we transform CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities. We also demonstrate significant use cases of this modality-specific features including detecting gender bias, adversarial attack defense and text-to-image model editing. These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.

Title: Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Authors: Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong Zhou, Fang Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14889
Pdf URL: https://arxiv.org/pdf/2502.14889
Copy Paste: [[2502.14889]] Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability(https://arxiv.org/abs/2502.14889)
Keywords: robust, interpretability
Abstract: The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at this https URL.

Title: WeedVision: Multi-Stage Growth and Classification of Weeds using DETR and RetinaNet for Precision Agriculture

Authors: Taminul Islam, Toqi Tahamid Sarker, Khaled R Ahmed, Cristiana Bernardi Rankrape, Karla Gage
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.14890
Pdf URL: https://arxiv.org/pdf/2502.14890
Copy Paste: [[2502.14890]] WeedVision: Multi-Stage Growth and Classification of Weeds using DETR and RetinaNet for Precision Agriculture(https://arxiv.org/abs/2502.14890)
Keywords: robust, transformer
Abstract: Weed management remains a critical challenge in agriculture, where weeds compete with crops for essential resources, leading to significant yield losses. Accurate detection of weeds at various growth stages is crucial for effective management yet challenging for farmers, as it requires identifying different species at multiple growth phases. This research addresses these challenges by utilizing advanced object detection models, specifically, the Detection Transformer (DETR) with a ResNet50 backbone and RetinaNet with a ResNeXt101 backbone, to identify and classify 16 weed species of economic concern across 174 classes, spanning their 11 weeks growth stages from seedling to maturity. A robust dataset comprising 203,567 images was developed, meticulously labeled by species and growth stage. The models were rigorously trained and evaluated, with RetinaNet demonstrating superior performance, achieving a mean Average Precision (mAP) of 0.907 on the training set and 0.904 on the test set, compared to DETR's mAP of 0.854 and 0.840, respectively. RetinaNet also outperformed DETR in recall and inference speed of 7.28 FPS, making it more suitable for real time applications. Both models showed improved accuracy as plants matured. This research provides crucial insights for developing precise, sustainable, and automated weed management strategies, paving the way for real time species specific detection systems and advancing AI-assisted agriculture through continued innovation in model development and early detection accuracy.

Title: CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection

Authors: Zhe Huang, Shuo Wang, Yongcai Wang, Lei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14891
Pdf URL: https://arxiv.org/pdf/2502.14891
Copy Paste: [[2502.14891]] CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection(https://arxiv.org/abs/2502.14891)
Keywords: robust, diffusion
Abstract: Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi-agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi-agent collaborative perception. Specifically, we project high-dimensional feature map into the latent space of a powerful pre-trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model's sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high-level noise.

Title: NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

Authors: Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinghao Tian, Zuchao Li, Lefei Zhang, Ping Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.14893
Pdf URL: https://arxiv.org/pdf/2502.14893
Copy Paste: [[2502.14893]] NOTA: Multimodal Music Notation Understanding for Visual Large Language Model(https://arxiv.org/abs/2502.14893)
Keywords: extraction, large language model
Abstract: Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline. Our datasets are open-sourced at this https URL.

Title: FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

Authors: Jowaria Khan, Alexa Friedman, Sydney Evans, Runzi Wang, Kaley Beins, David Andrews, Elizabeth Bondi-Kelly
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14894
Pdf URL: https://arxiv.org/pdf/2502.14894
Copy Paste: [[2502.14894]] FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction(https://arxiv.org/abs/2502.14894)
Keywords: protect, segmentation
Abstract: Per and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results highlight our framework's potential for scalable PFAS monitoring.

Title: A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models

Authors: Changhoon Kim, Yanjun Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14896
Pdf URL: https://arxiv.org/pdf/2502.14896
Copy Paste: [[2502.14896]] A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2502.14896)
Keywords: defense, attack, robust, diffusion
Abstract: Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts. However, their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns. Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content. In this survey, we provide a structured overview of concept erasure, categorizing existing methods based on their optimization strategies and the architectural components they modify. We categorize concept erasure methods into fine-tuning for parameter updates, closed-form solutions for efficient edits, and inference-time interventions for content restriction without weight modification. Additionally, we explore adversarial attacks that bypass erasure techniques and discuss emerging defenses. To support further research, we consolidate key datasets, evaluation metrics, and benchmarks for assessing erasure effectiveness and model robustness. This survey serves as a comprehensive resource, offering insights into the evolving landscape of concept erasure, its challenges, and future directions.

Title: Retrieval-augmented systems can be dangerous medical communicators

Authors: Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zeijang Shen, Yoon Kim, Monica Agrawal
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14898
Pdf URL: https://arxiv.org/pdf/2502.14898
Copy Paste: [[2502.14898]] Retrieval-augmented systems can be dangerous medical communicators(https://arxiv.org/abs/2502.14898)
Keywords: generative
Abstract: Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their health-related queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. This paper argues that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations -- such as the incorporation of communication pragmatics and enhanced comprehension of source documents -- that could help mitigate these issues and extend beyond the medical domain.

Title: Can AI mimic the human ability to define neologisms?

Authors: Georgios P. Georgiou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14900
Pdf URL: https://arxiv.org/pdf/2502.14900
Copy Paste: [[2502.14900]] Can AI mimic the human ability to define neologisms?(https://arxiv.org/abs/2502.14900)
Keywords: fair
Abstract: One ongoing debate in linguistics is whether Artificial Intelligence (AI) can effectively mimic human performance in language-related tasks. While much research has focused on various linguistic abilities of AI, little attention has been given to how it defines neologisms formed through different word formation processes. This study addresses this gap by examining the degree of agreement between human and AI-generated responses in defining three types of Greek neologisms: blends, compounds, and derivatives. The study employed an online experiment in which human participants selected the most appropriate definitions for neologisms, while ChatGPT received identical prompts. The results revealed fair agreement between human and AI responses for blends and derivatives but no agreement for compounds. However, when considering the majority response among humans, agreement with AI was high for blends and derivatives. These findings highlight the complexity of human language and the challenges AI still faces in capturing its nuances. In particular, they suggest a need for integrating more advanced semantic networks and contextual learning mechanisms into AI models to improve their interpretation of complex word formations, especially compounds.

Title: PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

Authors: Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14902
Pdf URL: https://arxiv.org/pdf/2502.14902
Copy Paste: [[2502.14902]] PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths(https://arxiv.org/abs/2502.14902)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: this https URL

Title: Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

Authors: Bhavik Agarwal, Ishan Joshi, Viktoria Rojkova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14905
Pdf URL: https://arxiv.org/pdf/2502.14905
Copy Paste: [[2502.14905]] Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence(https://arxiv.org/abs/2502.14905)
Keywords: robust, large language model
Abstract: In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

Title: Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models

Authors: Srishti Yadav, Zhi Zhang, Daniel Hershcovich, Ekaterina Shutova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14906
Pdf URL: https://arxiv.org/pdf/2502.14906
Copy Paste: [[2502.14906]] Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models(https://arxiv.org/abs/2502.14906)
Keywords: large language model
Abstract: Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models.

Title: GneissWeb: Preparing High Quality Data for LLMs at Scale

Authors: Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah. Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Ravital Eres, Ran Iwamoto, Alexei Karve, Yan Koyfman, Wei-Han Lee, Changchang Liu, Boris Lublinsky, Takuyo Ohko, Pablo Pesce, Maroun Touma, Shiqiang Wang, Shalisha Witherspoon, Herbert Woisetschlager, David Wood, Kun-Lung Wu, Issei Yoshida, Syed Zawad, Petros Zerfos, Yi Zhou, Bishwaranjan Bhattacharjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14907
Pdf URL: https://arxiv.org/pdf/2502.14907
Copy Paste: [[2502.14907]] GneissWeb: Preparing High Quality Data for LLMs at Scale(https://arxiv.org/abs/2502.14907)
Keywords: large language model
Abstract: Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

Title: KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models

Authors: Peter Carragher, Nikitha Rao, Abhinand Jha, R Raghav, Kathleen M. Carley
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14908
Pdf URL: https://arxiv.org/pdf/2502.14908
Copy Paste: [[2502.14908]] KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models(https://arxiv.org/abs/2502.14908)
Keywords: robust, large language model
Abstract: The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose \segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (<30% accuracy) and fail to reason over source conflicts (<1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.

Title: EvoP: Robust LLM Inference via Evolutionary Pruning

Authors: Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-wei Kuo, Nan Guan, Chun Jason Xue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14910
Pdf URL: https://arxiv.org/pdf/2502.14910
Copy Paste: [[2502.14910]] EvoP: Robust LLM Inference via Evolutionary Pruning(https://arxiv.org/abs/2502.14910)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing structured pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing structured pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.

Title: Batayan: A Filipino NLP benchmark for evaluating Large Language Models

Authors: Jann Railey Montalan, Jimson Paulo Layacan, David Demitri Africa, Richell Isaiah Flores, Michael T. Lopez II, Theresa Denise Magsajo, Anjanette Cayabyab, William Chandra Tjhi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14911
Pdf URL: https://arxiv.org/pdf/2502.14911
Copy Paste: [[2502.14911]] Batayan: A Filipino NLP benchmark for evaluating Large Language Models(https://arxiv.org/abs/2502.14911)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities on widely benchmarked high-resource languages; however, linguistic nuances of under-resourced languages remain unexplored. We introduce Batayan, a holistic Filipino benchmark designed to systematically evaluate LLMs across three key natural language processing (NLP) competencies: understanding, reasoning, and generation. Batayan consolidates eight tasks, covering both Tagalog and code-switched Taglish utterances. Our rigorous, native-speaker-driven annotation process ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino, alleviating a pervasive translationese bias in existing Filipino corpora. We report empirical results on a variety of multilingual LLMs, highlighting significant performance gaps that signal the under-representation of Filipino in pretraining corpora, the unique hurdles in modeling Filipino's rich morphology and construction, and the importance of explicit Filipino language support and instruction tuning. Moreover, we discuss the practical challenges encountered in dataset construction and propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages. We also provide a public benchmark and leaderboard as a clear foundation for iterative, community-driven progress in Filipino NLP.

Title: Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery

Authors: Yunze Jia, Yuehui Xian, Yangyang Xu, Pengfei Dang, Xiangdong Ding, Jun Sun, Yumei Zhou, Dezhen Xue
Subjects: cs.CL, cond-mat.mtrl-sci, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14912
Pdf URL: https://arxiv.org/pdf/2502.14912
Copy Paste: [[2502.14912]] Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery(https://arxiv.org/abs/2502.14912)
Keywords: robust
Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.

Title: OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

Authors: Xiangjin Xie, Guangwei Xu, Lingyan Zhao, Ruijie Guo
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14913
Pdf URL: https://arxiv.org/pdf/2502.14913
Copy Paste: [[2502.14913]] OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment(https://arxiv.org/abs/2502.14913)
Keywords: extraction, large language model
Abstract: Although multi-agent collaborative Large Language Models (LLMs) have achieved significant breakthroughs in the Text-to-SQL task, their performance is still constrained by various factors. These factors include the incompleteness of the framework, failure to follow instructions, and model hallucination problems. To address these problems, we propose OpenSearch-SQL, which divides the Text-to-SQL task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on a consistency alignment mechanism. This architecture aligns the inputs and outputs of agents through the Alignment module, reducing failures in instruction following and hallucination. Additionally, we designed an intermediate language called SQL-Like and optimized the structured CoT based on SQL-Like. Meanwhile, we developed a dynamic few-shot strategy in the form of self-taught Query-CoT-SQL. These methods have significantly improved the performance of LLMs in the Text-to-SQL task. In terms of model selection, we directly applied the base LLMs without any post-training, thereby simplifying the task chain and enhancing the framework's portability. Experimental results show that OpenSearch-SQL achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based validity efficiency score (R-VES) of 69.36%, with all three metrics ranking first at the time of submission. These results demonstrate the comprehensive advantages of the proposed method in both effectiveness and efficiency.

Title: What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Authors: Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Boqiang Zhang, Nianzu Yang, Pandeng Li, Yun Zheng, Hongtao Xie
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14914
Pdf URL: https://arxiv.org/pdf/2502.14914
Copy Paste: [[2502.14914]] What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs(https://arxiv.org/abs/2502.14914)
Keywords: large language model
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.

Title: Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

Authors: Rui Zhao, Qirui Yuan, Jinyu Li, Haofeng Hu, Yun Li, Chengyuan Zheng, Fei Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14917
Pdf URL: https://arxiv.org/pdf/2502.14917
Copy Paste: [[2502.14917]] Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning(https://arxiv.org/abs/2502.14917)
Keywords: robust, large language model
Abstract: End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control, thereby further bridging the gap between autonomous driving and human thought processes. To elevate model performance, we have developed the first extensive Visual Question Answering (VQA) driving instruction dataset tailored for 3D spatial understanding and long-axis task reasoning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.

Title: RAPTOR: Refined Approach for Product Table Object Recognition

Authors: Eliott Thomas, Mickael Coustaty, Aurelie Joseph, Elodie Carel, Vincent Poulain D'Andecy, Jean-Marc Ogier
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14918
Pdf URL: https://arxiv.org/pdf/2502.14918
Copy Paste: [[2502.14918]] RAPTOR: Refined Approach for Product Table Object Recognition(https://arxiv.org/abs/2502.14918)
Keywords: extraction, transformer
Abstract: Extracting tables from documents is a critical task across various industries, especially on business documents like invoices and reports. Existing systems based on DEtection TRansformer (DETR) such as TAble TRansformer (TATR), offer solutions for Table Detection (TD) and Table Structure Recognition (TSR) but face challenges with diverse table formats and common errors like incorrect area detection and overlapping columns. This research introduces RAPTOR, a modular post-processing system designed to enhance state-of-the-art models for improved table extraction, particularly for product tables. RAPTOR addresses recurrent TD and TSR issues, improving both precision and structural predictions. For TD, we use DETR (trained on ICDAR 2019) and TATR (trained on PubTables-1M and FinTabNet), while TSR only relies on TATR. A Genetic Algorithm is incorporated to optimize RAPTOR's module parameters, using a private dataset of product tables to align with industrial needs. We evaluate our method on two private datasets of product tables, the public DOCILE dataset (which contains tables similar to our target product tables), and the ICDAR 2013 and ICDAR 2019 datasets. The results demonstrate that while our approach excels at product tables, it also maintains reasonable performance across diverse table formats. An ablation study further validates the contribution of each module in our system.

Title: The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

Authors: Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-Béguelin, Shruti Tople, Reza Shokri
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14921
Pdf URL: https://arxiv.org/pdf/2502.14921
Copy Paste: [[2502.14921]] The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text(https://arxiv.org/abs/2502.14921)
Keywords: privacy, attack, membership infer, large language model
Abstract: How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model's output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

Title: SIFT: Grounding LLM Reasoning in Contexts via Stickers

Authors: Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14922
Pdf URL: https://arxiv.org/pdf/2502.14922
Copy Paste: [[2502.14922]] SIFT: Grounding LLM Reasoning in Contexts via Stickers(https://arxiv.org/abs/2502.14922)
Keywords: large language model
Abstract: This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at this https URL.

Title: A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?

Authors: Ibrahim Alabdulmohsin, Andreas Steiner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14924
Pdf URL: https://arxiv.org/pdf/2502.14924
Copy Paste: [[2502.14924]] A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?(https://arxiv.org/abs/2502.14924)
Keywords: robust, large language model
Abstract: Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.

Title: Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection

Authors: Han Zhang, Langshi Zhou, Hanfang Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14932
Pdf URL: https://arxiv.org/pdf/2502.14932
Copy Paste: [[2502.14932]] Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection(https://arxiv.org/abs/2502.14932)
Keywords: interpretability, large language model
Abstract: Extensive research has investigated the integration of large language models (LLMs) with knowledge graphs to enhance the reasoning process. However, understanding how models perform reasoning utilizing structured graph knowledge remains underexplored. Most existing approaches rely on LLMs or retrievers to make binary judgments regarding the utilization of knowledge, which is too coarse. Meanwhile, there is still a lack of feedback mechanisms for reflection and correction throughout the entire reasoning path. This paper proposes an Active self-Reflection framework for knowledge Graph reasoning ARG, introducing for the first time an end-to-end training approach to achieve iterative reasoning grounded on structured graphs. Within the framework, the model leverages special tokens to \textit{actively} determine whether knowledge retrieval is necessary, performs \textit{reflective} critique based on the retrieved knowledge, and iteratively reasons over the knowledge graph. The reasoning paths generated by the model exhibit high interpretability, enabling deeper exploration of the model's understanding of structured knowledge. Ultimately, the proposed model achieves outstanding results compared to existing baselines in knowledge graph reasoning tasks.

Title: Online hand gesture recognition using Continual Graph Transformers

Authors: Rim Slama, Wael Rabah, Hazem Wannous
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14939
Pdf URL: https://arxiv.org/pdf/2502.14939
Copy Paste: [[2502.14939]] Online hand gesture recognition using Continual Graph Transformers(https://arxiv.org/abs/2502.14939)
Keywords: robust, extraction, transformer
Abstract: Online continuous action recognition has emerged as a critical research area due to its practical implications in real-world applications, such as human-computer interaction, healthcare, and robotics. Among various modalities, skeleton-based approaches have gained significant popularity, demonstrating their effectiveness in capturing 3D temporal data while ensuring robustness to environmental variations. However, most existing works focus on segment-based recognition, making them unsuitable for real-time, continuous recognition scenarios. In this paper, we propose a novel online recognition system designed for real-time skeleton sequence streaming. Our approach leverages a hybrid architecture combining Spatial Graph Convolutional Networks (S-GCN) for spatial feature extraction and a Transformer-based Graph Encoder (TGE) for capturing temporal dependencies across frames. Additionally, we introduce a continual learning mechanism to enhance model adaptability to evolving data distributions, ensuring robust recognition in dynamic environments. We evaluate our method on the SHREC'21 benchmark dataset, demonstrating its superior performance in online hand gesture recognition. Our approach not only achieves state-of-the-art accuracy but also significantly reduces false positive rates, making it a compelling solution for real-time applications. The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies, where natural and intuitive interaction is crucial.

Title: FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models

Authors: Thomas Froech, Olaf Wysocki, Yan Xia, Junyu Xie, Benedikt Schwab, Daniel Cremers, Thomas H. Kolbe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14940
Pdf URL: https://arxiv.org/pdf/2502.14940
Copy Paste: [[2502.14940]] FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models(https://arxiv.org/abs/2502.14940)
Keywords: diffusion
Abstract: High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings' locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by $22\%$ when applying the completed conflict maps for high-definition 3D semantic building reconstruction. The code is be publicly available in the corresponding GitHub repository: this https URL

Title: KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Authors: Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Ahmed, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
Subjects: cs.CV, cs.AI, cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14949
Pdf URL: https://arxiv.org/pdf/2502.14949
Copy Paste: [[2502.14949]] KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding(https://arxiv.org/abs/2502.14949)
Keywords: robust, extraction
Abstract: With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

Title: CyberSentinel: An Emergent Threat Detection System for AI Security

Authors: Krti Tallam
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14966
Pdf URL: https://arxiv.org/pdf/2502.14966
Copy Paste: [[2502.14966]] CyberSentinel: An Emergent Threat Detection System for AI Security(https://arxiv.org/abs/2502.14966)
Keywords: security, defense, attack
Abstract: The rapid advancement of artificial intelligence (AI) has significantly expanded the attack surface for AI-driven cybersecurity threats, necessitating adaptive defense strategies. This paper introduces CyberSentinel, a unified, single-agent system for emergent threat detection, designed to identify and mitigate novel security risks in real time. CyberSentinel integrates: (1) Brute-force attack detection through SSH log analysis, (2) Phishing threat assessment using domain blacklists and heuristic URL scoring, and (3) Emergent threat detection via machine learning-based anomaly detection. By continuously adapting to evolving adversarial tactics, CyberSentinel strengthens proactive cybersecurity defense, addressing critical vulnerabilities in AI security.

Title: Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries

Authors: David Noever, Grant Rosario
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14975
Pdf URL: https://arxiv.org/pdf/2502.14975
Copy Paste: [[2502.14975]] Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries(https://arxiv.org/abs/2502.14975)
Keywords: large language model
Abstract: We present an open-source benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). Using a dataset of 1156 prompts across six languages, we evaluated three leading LLMs (GPT-4o, Claude-3.5 Sonnet, and Mistral-large) on their ability to maintain appropriate emotional boundaries through pattern-matched response analysis. Our framework quantifies responses across seven key patterns: direct refusal, apology, explanation, deflection, acknowledgment, boundary setting, and emotional awareness. Results demonstrate significant variation in boundary-handling approaches, with Claude-3.5 achieving the highest overall score (8.69/10) and producing longer, more nuanced responses (86.51 words on average). We identified a substantial performance gap between English (average score 25.62) and non-English interactions (< 0.22), with English responses showing markedly higher refusal rates (43.20% vs. < 1% for non-English). Pattern analysis revealed model-specific strategies, such as Mistral's preference for deflection (4.2%) and consistently low empathy scores across all models (< 0.06). Limitations include potential oversimplification through pattern matching, lack of contextual understanding in response analysis, and binary classification of complex emotional responses. Future work should explore more nuanced scoring methods, expand language coverage, and investigate cultural variations in emotional boundary expectations. Our benchmark and methodology provide a foundation for systematic evaluation of LLM emotional intelligence and boundary-setting capabilities.

Title: EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models

Authors: Nastaran Darabi, Devashri Naik, Sina Tayebati, Dinithi Jayasuriya, Ranganath Krishnan, Amit Ranjan Trivedi
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14976
Pdf URL: https://arxiv.org/pdf/2502.14976
Copy Paste: [[2502.14976]] EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models(https://arxiv.org/abs/2502.14976)
Keywords: defense, attack, robust, large language model
Abstract: Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.

Title: LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection

Authors: Qingyuan Liu, Yun-Yun Tsai, Ruijian Zha, Victoria Li, Pengyuan Shi, Chengzhi Mao, Junfeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.14994
Pdf URL: https://arxiv.org/pdf/2502.14994
Copy Paste: [[2502.14994]] LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection(https://arxiv.org/abs/2502.14994)
Keywords: privacy, diffusion, generative
Abstract: The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.

Title: Generative Modeling of Individual Behavior at Scale

Authors: Nabil Omi, Lucas Caccia, Anurag Sarkar, Jordan T. Ash, Siddhartha Sen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.14998
Pdf URL: https://arxiv.org/pdf/2502.14998
Copy Paste: [[2502.14998]] Generative Modeling of Individual Behavior at Scale(https://arxiv.org/abs/2502.14998)
Keywords: generative
Abstract: There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem -- where each task represents a distinct person -- and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared "skill" parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player's style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.

Title: LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Authors: Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, Andrey Kuznetsov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15007
Pdf URL: https://arxiv.org/pdf/2502.15007
Copy Paste: [[2502.15007]] LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers(https://arxiv.org/abs/2502.15007)
Keywords: transformer, large language model
Abstract: We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens -- especially stopwords, articles, and commas -- consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer's embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.

Title: Contextualizing Search Queries In-Context Learning for Conversational Rewriting with LLMs

Authors: Raymond Wilson, Chase Carter, Cole Graham
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15009
Pdf URL: https://arxiv.org/pdf/2502.15009
Copy Paste: [[2502.15009]] Contextualizing Search Queries In-Context Learning for Conversational Rewriting with LLMs(https://arxiv.org/abs/2502.15009)
Keywords: large language model
Abstract: Conversational query rewriting is crucial for effective conversational search, yet traditional supervised methods require substantial labeled data, which is scarce in low-resource settings. This paper introduces Prompt-Guided In-Context Learning, a novel approach that leverages the in-context learning capabilities of Large Language Models (LLMs) for few-shot conversational query rewriting. Our method employs carefully designed prompts, incorporating task descriptions, input/output format specifications, and a small set of illustrative examples, to guide pre-trained LLMs to generate context-independent queries without explicit fine-tuning. Extensive experiments on benchmark datasets, TREC and Taskmaster-1, demonstrate that our approach significantly outperforms strong baselines, including supervised models and contrastive co-training methods, across various evaluation metrics such as BLEU, ROUGE-L, Success Rate, and MRR. Ablation studies confirm the importance of in-context examples, and human evaluations further validate the superior fluency, relevance, and context utilization of our generated rewrites. The results highlight the potential of prompt-guided in-context learning as an efficient and effective paradigm for low-resource conversational query rewriting, reducing the reliance on extensive labeled data and complex training procedures.

Title: Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

Authors: Mark Russinovich, Ahmed Salem
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15010
Pdf URL: https://arxiv.org/pdf/2502.15010
Copy Paste: [[2502.15010]] Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models(https://arxiv.org/abs/2502.15010)
Keywords: protect, large language model
Abstract: Recent copyright agreements between AI companies and content creators have highlighted the need for precise control over language models' ability to reproduce copyrighted content. While existing approaches rely on either complete concept removal through unlearning or simple output filtering, we propose Obliviate, a novel post-training technique that selectively prevents verbatim reproduction of specific text while preserving semantic understanding. Obliviate operates by selecting tokens within memorized sequences and modifying the model's probability distribution to prevent exact reproduction while maintaining contextual understanding. We evaluate Obliviate on multiple large language models (LLaMA-3.1 8B, LLaMA-3.1-instruct 8B, Qwen-2.5-7B, and Yi-1.5 6B) across both synthetic memorization tasks and organic copyright content. Our results demonstrate that Obliviate achieves orders of magnitude reduction, e.g., 100x, in verbatim memorization while maintaining model performance within 1% of baseline on standard benchmarks (HellaSwag, MMLU, TruthfulQA, and Winogrande). This makes Obliviate particularly suitable for practical deployment scenarios where companies need to efficiently address copyright concerns in pretrained models without compromising their general capabilities.

Title: CrossOver: 3D Scene Cross-Modal Alignment

Authors: Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15011
Pdf URL: https://arxiv.org/pdf/2502.15011
Copy Paste: [[2502.15011]] CrossOver: 3D Scene Cross-Modal Alignment(https://arxiv.org/abs/2502.15011)
Keywords: robust
Abstract: Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.

Title: Graph in the Vault: Protecting Edge GNN Inference with Trusted Execution Environment

Authors: Ruyi Ding, Tianhong Xu, Aidong Adam Ding, Yunsi Fei
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15012
Pdf URL: https://arxiv.org/pdf/2502.15012
Copy Paste: [[2502.15012]] Graph in the Vault: Protecting Edge GNN Inference with Trusted Execution Environment(https://arxiv.org/abs/2502.15012)
Keywords: secure, privacy, protect, attack, steal
Abstract: Wide deployment of machine learning models on edge devices has rendered the model intellectual property (IP) and data privacy vulnerable. We propose GNNVault, the first secure Graph Neural Network (GNN) deployment strategy based on Trusted Execution Environment (TEE). GNNVault follows the design of 'partition-before-training' and includes a private GNN rectifier to complement with a public backbone model. This way, both critical GNN model parameters and the private graph used during inference are protected within secure TEE compartments. Real-world implementations with Intel SGX demonstrate that GNNVault safeguards GNN inference against state-of-the-art link stealing attacks with negligible accuracy degradation (<2%).

Title: Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

Authors: Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, Boyuan Feng, Less Wright, Edward Z. Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George E. Dahl
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.15015
Pdf URL: https://arxiv.org/pdf/2502.15015
Copy Paste: [[2502.15015]] Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition(https://arxiv.org/abs/2502.15015)
Keywords: robust, fair
Abstract: The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

Title: TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Authors: Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15016
Pdf URL: https://arxiv.org/pdf/2502.15016
Copy Paste: [[2502.15016]] TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation(https://arxiv.org/abs/2502.15016)
Keywords: transformer
Abstract: Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

Title: Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

Authors: Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2502.15017
Pdf URL: https://arxiv.org/pdf/2502.15017
Copy Paste: [[2502.15017]] Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability(https://arxiv.org/abs/2502.15017)
Keywords: attack, robust, interpretability
Abstract: Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

Title: Using tournaments to calculate AUROC for zero-shot classification with LLMs

Authors: Wonjin Yoon, Ian Bulovic, Timothy A. Miller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15018
Pdf URL: https://arxiv.org/pdf/2502.15018
Copy Paste: [[2502.15018]] Using tournaments to calculate AUROC for zero-shot classification with LLMs(https://arxiv.org/abs/2502.15018)
Keywords: fair, large language model
Abstract: Large language models perform surprisingly well on many zero-shot classification tasks, but are difficult to fairly compare to supervised classifiers due to the lack of a modifiable decision boundary. In this work, we propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks, obtaining relative rankings from LLMs. Repeated pairwise comparisons can be used to score instances using the Elo rating system (used in chess and other competitions), inducing a confidence ordering over instances in a dataset. We evaluate scheduling algorithms for their ability to minimize comparisons, and show that our proposed algorithm leads to improved classification performance, while also providing more information than traditional zero-shot classification.

Title: MACPruning: Dynamic Operation Pruning to Mitigate Side-Channel DNN Model Extraction

Authors: Ruyi Ding, Cheng Gongye, Davis Ranney, Aidong Adam Ding, Yunsi Fei
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15020
Pdf URL: https://arxiv.org/pdf/2502.15020
Copy Paste: [[2502.15020]] MACPruning: Dynamic Operation Pruning to Mitigate Side-Channel DNN Model Extraction(https://arxiv.org/abs/2502.15020)
Keywords: security, defense, attack, robust, extraction
Abstract: As deep learning gains popularity, edge IoT devices have seen proliferating deployment of pre-trained Deep Neural Network (DNN) models. These DNNs represent valuable intellectual property and face significant confidentiality threats from side-channel analysis (SCA), particularly non-invasive Differential Electromagnetic (EM) Analysis (DEMA), which retrieves individual model parameters from EM traces collected during model inference. Traditional SCA mitigation methods, such as masking and shuffling, can still be applied to DNN inference, but will incur significant performance degradation due to the large volume of operations and parameters. Based on the insight that DNN models have high redundancy and are robust to input variation, we introduce MACPruning, a novel lightweight defense against DEMA-based parameter extraction attacks, exploiting specific characteristics of DNN execution. The design principle of MACPruning is to randomly deactivate input pixels and prune the operations (typically multiply-accumulate-MAC) on those pixels. The technique removes certain leakages and overall redistributes weight-dependent EM leakages temporally, and thus effectively mitigates DEMA. To maintain DNN performance, we propose an importance-aware pixel map that preserves critical input pixels, keeping randomness in the defense while minimizing its impact on DNN performance due to operation pruning. We conduct a comprehensive security analysis of MACPruning on various datasets for DNNs on edge devices. Our evaluations demonstrate that MACPruning effectively reduces EM leakages with minimal impact on the model accuracy and negligible computational overhead.

Title: Simpler Fast Vision Transformers with a Jumbo CLS Token

Authors: Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15021
Pdf URL: https://arxiv.org/pdf/2502.15021
Copy Paste: [[2502.15021]] Simpler Fast Vision Transformers with a Jumbo CLS Token(https://arxiv.org/abs/2502.15021)
Keywords: transformer
Abstract: We introduce a simple enhancement to the global processing of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Jumbo significantly improves over ViT+Registers on ImageNet-1K at high speeds (by 3.2% for ViT-tiny and 13.5% for ViT-nano); these Jumbo models even outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs. Although Jumbo sees no gains for ViT-small on ImageNet-1K, it gains 3.4% on ImageNet-21K over ViT+Registers. Both findings indicate that Jumbo is most helpful when the ViT is otherwise too narrow for the task. Finally, we show that Jumbo can be easily adapted to excel on data beyond images, e.g., time series.

Title: A Meta-Evaluation of Style and Attribute Transfer Metrics

Authors: Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15022
Pdf URL: https://arxiv.org/pdf/2502.15022
Copy Paste: [[2502.15022]] A Meta-Evaluation of Style and Attribute Transfer Metrics(https://arxiv.org/abs/2502.15022)
Keywords: fair
Abstract: LLMs make it easy to rewrite text in any style, be it more polite, persuasive, or more positive. We present a large-scale study of evaluation metrics for style and attribute transfer with a focus on content preservation; meaning content not attributed to the style shift is preserved. The de facto evaluation approach uses lexical or semantic similarity metrics often between source sentences and rewrites. While these metrics are not designed to distinguish between style or content differences, empirical meta-evaluation shows a reasonable correlation to human judgment. In fact, recent works find that LLMs prompted as evaluators are only comparable to semantic similarity metrics, even though intuitively, the LLM approach should better fit the task. To investigate this discrepancy, we benchmark 8 metrics for evaluating content preservation on existing datasets and additionally construct a new test set that better aligns with the meta-evaluation aim. Indeed, we then find that the empirical conclusion aligns with the intuition: content preservation metrics for style/attribute transfer must be conditional on the style shift. To support this, we propose a new efficient zero-shot evaluation method using the likelihood of the next token. We hope our meta-evaluation can foster more research on evaluating content preservation metrics, and also to ensure fair evaluation of methods for conducting style transfer.

Title: GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data

Authors: Rui Deng, Ziqi Li, Mingshu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15032
Pdf URL: https://arxiv.org/pdf/2502.15032
Copy Paste: [[2502.15032]] GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data(https://arxiv.org/abs/2502.15032)
Keywords: transformer
Abstract: Modeling geospatial tabular data with deep learning has become a promising alternative to traditional statistical and machine learning approaches. However, existing deep learning models often face challenges related to scalability and flexibility as datasets grow. To this end, this paper introduces GeoAggregator, an efficient and lightweight algorithm based on transformer architecture designed specifically for geospatial tabular data modeling. GeoAggregators explicitly account for spatial autocorrelation and spatial heterogeneity through Gaussian-biased local attention and global positional awareness. Additionally, we introduce a new attention mechanism that uses the Cartesian product to manage the size of the model while maintaining strong expressive power. We benchmark GeoAggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. GeoAggregator's efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the GeoAggregator's performance.

Title: Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

Authors: Yun-Wei Chu, Kai Zhang, Christopher Malon, Martin Renqiang Min
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15040
Pdf URL: https://arxiv.org/pdf/2502.15040
Copy Paste: [[2502.15040]] Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation(https://arxiv.org/abs/2502.15040)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.

Title: Benchmarking Android Malware Detection: Rethinking the Role of Traditional and Deep Learning Models

Authors: Guojun Liu, Doina Caragea, Xinming Ou, Sankardas Roy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15041
Pdf URL: https://arxiv.org/pdf/2502.15041
Copy Paste: [[2502.15041]] Benchmarking Android Malware Detection: Rethinking the Role of Traditional and Deep Learning Models(https://arxiv.org/abs/2502.15041)
Keywords: robust
Abstract: Android malware detection has been extensively studied using both traditional machine learning (ML) and deep learning (DL) approaches. While many state-of-the-art detection models, particularly those based on DL, claim superior performance, they often rely on limited comparisons, lacking comprehensive benchmarking against traditional ML models across diverse datasets. This raises concerns about the robustness of DL-based approaches' performance and the potential oversight of simpler, more efficient ML models. In this paper, we conduct a systematic evaluation of Android malware detection models across four datasets: three recently published, publicly available datasets and a large-scale dataset we systematically collected. We implement a range of traditional ML models, including Random Forests (RF) and CatBoost, alongside advanced DL models such as Capsule Graph Neural Networks (CapsGNN), BERT-based models, and ExcelFormer based models. Our results reveal that while advanced DL models can achieve strong performance, they are often compared against an insufficient number of traditional ML baselines. In many cases, simpler and more computationally efficient ML models achieve comparable or even superior performance. These findings highlight the need for rigorous benchmarking in Android malware detection research. We encourage future studies to conduct more comprehensive benchmarking comparisons between traditional and advanced models to ensure a more accurate assessment of detection capabilities. To facilitate further research, we provide access to our dataset, including app IDs, hash values, and labels.

Title: Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

Authors: Elliot Schumacher, Dhruv Naik, Anitha Kannan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15069
Pdf URL: https://arxiv.org/pdf/2502.15069
Copy Paste: [[2502.15069]] Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease(https://arxiv.org/abs/2502.15069)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

Title: Visualizing Machine Learning Models for Enhanced Financial Decision-Making and Risk Management

Authors: Priyam Ganguly, Ramakrishna Garine, Isha Mukherjee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15073
Pdf URL: https://arxiv.org/pdf/2502.15073
Copy Paste: [[2502.15073]] Visualizing Machine Learning Models for Enhanced Financial Decision-Making and Risk Management(https://arxiv.org/abs/2502.15073)
Keywords: extraction, interpretability
Abstract: This study emphasizes how crucial it is to visualize machine learning models, especially for the banking industry, in order to improve interpretability and support predictions in high stakes financial settings. Visual tools enable performance improvements and support the creation of innovative financial models by offering crucial insights into the algorithmic decision-making processes. Within a financial machine learning framework, the research uses visually guided experiments to make important concepts, such risk assessment and portfolio allocation, more understandable. The study also examines variations in trading tactics and how they relate to risk appetite, coming to the conclusion that the frequency of portfolio rebalancing is negatively correlated with risk tolerance. Finding these ideas is made possible in large part by visualization. The study concludes by presenting a novel method of locally stochastic asset weighing, where visualization facilitates data extraction and validation. This highlights the usefulness of these methods in furthering the field of financial machine learning research.

Title: More for Keys, Less for Values: Adaptive KV Cache Quantization

Authors: Mohsen Hariri, Lam Nguyen, Sixu Chen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15075
Pdf URL: https://arxiv.org/pdf/2502.15075
Copy Paste: [[2502.15075]] More for Keys, Less for Values: Adaptive KV Cache Quantization(https://arxiv.org/abs/2502.15075)
Keywords: transformer, large language model
Abstract: This paper introduces an information-aware quantization framework that adaptively compresses the key-value (KV) cache in large language models (LLMs). Although prior work has underscored the distinct roles of key and value cache during inference, our systematic analysis -- examining singular value distributions, spectral norms, and Frobenius norms -- reveals, for the first time, that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. Furthermore, our theoretical analysis shows that matrices with higher spectral norms amplify quantization errors more significantly. Motivated by these insights, we propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bit-width for keys and fewer for values since key matrices have higher norm values. With the same total KV bit budget, this approach effectively mitigates error propagation across transformer layers while achieving significant memory savings. Our extensive experiments on multiple LLMs (1B--70B) demonstrate that our mixed-precision quantization scheme maintains high model accuracy even under aggressive compression. For instance, using 4-bit for Key and 2-bit for Value achieves an accuracy of 75.2%, whereas reversing the assignment (2-bit for Key and 4-bit for Value) yields only 54.7% accuracy. The code is available at this https URL

Title: Hardware-Friendly Static Quantization Method for Video Diffusion Transformers

Authors: Sanghyun Yi, Qingfeng Liu, Mostafa El-Khamy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15077
Pdf URL: https://arxiv.org/pdf/2502.15077
Copy Paste: [[2502.15077]] Hardware-Friendly Static Quantization Method for Video Diffusion Transformers(https://arxiv.org/abs/2502.15077)
Keywords: diffusion, transformer, generative
Abstract: Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSora\cite{opensora}, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.

Title: UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15082
Pdf URL: https://arxiv.org/pdf/2502.15082
Copy Paste: [[2502.15082]] UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning(https://arxiv.org/abs/2502.15082)
Keywords: large language model
Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. We evaluate UPCORE across three standard unlearning methods consistently achieving a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. We find that UPCORE improves both standard metrics and AUC, benefitting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.

Title: Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

Authors: Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, Chanyoung Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15086
Pdf URL: https://arxiv.org/pdf/2502.15086
Copy Paste: [[2502.15086]] Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models(https://arxiv.org/abs/2502.15086)
Keywords: large language model
Abstract: As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at this https URL.

Title: Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Authors: Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15090
Pdf URL: https://arxiv.org/pdf/2502.15090
Copy Paste: [[2502.15090]] Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans(https://arxiv.org/abs/2502.15090)
Keywords: large language model
Abstract: Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM's learned representations align with human representations. In this work, we introduce a novel approach to the study of representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., 'cat') and then analyze the corresponding activation patterns. Our findings reveal that LLM representations closely align with human representations inferred from behavioral data. Notably, this alignment surpasses that of word embeddings, which have been center stage in prior work on human and model alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts. Specifically, we show that LLMs organize concepts in a way that reflects hierarchical relationships interpretable to humans (e.g., 'animal'-'dog').

Title: Optimizing Singular Spectrum for Large Language Model Compression

Authors: Dengjie Li, Tiancheng Shen, Yao Zhou, Baisong Yang, Zhongying Liu, Masheng Yang, Bernard Ghanem, Yibo Yang, Yujie Zhong, Ming-Hsuan Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15092
Pdf URL: https://arxiv.org/pdf/2502.15092
Copy Paste: [[2502.15092]] Optimizing Singular Spectrum for Large Language Model Compression(https://arxiv.org/abs/2502.15092)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, yet prohibitive parameter complexity often hinders their deployment. Existing singular value decomposition (SVD) based compression methods simply deem singular values as importance scores of decomposed components. However, this importance ordered by singular values does not necessarily correlate with the performance of a downstream task. In this work, we introduce SoCo (Singular spectrum optimization for large language model Compression), a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner. Concretely, we employ a learnable diagonal matrix to assign importance scores for singular spectrum and develop a three-stage training process that progressively refines these scores from initial coarse compression to fine-grained sparsification-thereby striking an effective balance between aggressive model compression and performance preservation. Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores, rather than relying on the fixed order of singular values. More importantly, the remaining components with amplified importance scores can compensate for the loss of the pruned ones. Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.

Title: Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models

Authors: Marianne Chuang, Gabriel Chuang, Cheryl Chuang, John Chuang
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2502.15094
Pdf URL: https://arxiv.org/pdf/2502.15094
Copy Paste: [[2502.15094]] Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models(https://arxiv.org/abs/2502.15094)
Keywords: robust, large language model
Abstract: We study the use of large language models (LLMs) to both evaluate and greenwash corporate climate disclosures. First, we investigate the use of the LLM-as-a-Judge (LLMJ) methodology for scoring company-submitted reports on emissions reduction targets and progress. Second, we probe the behavior of an LLM when it is prompted to greenwash a response subject to accuracy and length constraints. Finally, we test the robustness of the LLMJ methodology against responses that may be greenwashed using an LLM. We find that two LLMJ scoring systems, numerical rating and pairwise comparison, are effective in distinguishing high-performing companies from others, with the pairwise comparison system showing greater robustness against LLM-greenwashed responses.

Title: LUME: LLM Unlearning with Multitask Evaluations

Authors: Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15097
Pdf URL: https://arxiv.org/pdf/2502.15097
Copy Paste: [[2502.15097]] LUME: LLM Unlearning with Multitask Evaluations(https://arxiv.org/abs/2502.15097)
Keywords: large language model
Abstract: Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark (LUME) which features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently proposed unlearning algorithms and present results on carefully crafted metrics to understand their behavior and limitations.

Title: Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos

Authors: Brice Valentin Kok-Shun, Johnny Chan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15102
Pdf URL: https://arxiv.org/pdf/2502.15102
Copy Paste: [[2502.15102]] Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos(https://arxiv.org/abs/2502.15102)
Keywords: extraction
Abstract: This work-in-progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collection of 421 auto-generated and manual transcripts which are then fed into a prompt-engineered GPT-4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for category identification. The results revealed a significant prevalence of product-related ads across various educational topics, with ad categories refined using GPT-4o into succinct 9 content and 4 advertisement categories. This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. This study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of advertisement strategies in digital media.

Title: Assessing a Single Student's Concentration on Learning Platforms: A Machine Learning-Enhanced EEG-Based Framework

Authors: Zewen Zhuo, Mohamad Najafi, Hazem Zein, Amine Nait-Ali
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.15107
Pdf URL: https://arxiv.org/pdf/2502.15107
Copy Paste: [[2502.15107]] Assessing a Single Student's Concentration on Learning Platforms: A Machine Learning-Enhanced EEG-Based Framework(https://arxiv.org/abs/2502.15107)
Keywords: extraction
Abstract: This study introduces a specialized pipeline designed to classify the concentration state of an individual student during online learning sessions by training a custom-tailored machine learning model. Detailed protocols for acquiring and preprocessing EEG data are outlined, along with the extraction of fifty statistical features from five EEG signal bands: alpha, beta, theta, delta, and gamma. Following feature extraction, a thorough feature selection process was conducted to optimize the data inputs for a personalized analysis. The study also explores the benefits of hyperparameter fine-tuning to enhance the classification accuracy of the student's concentration state. EEG signals were captured from the student using a Muse headband (Gen 2), equipped with five electrodes (TP9, AF7, AF8, TP10, and a reference electrode NZ), during engagement with educational content on computer-based e-learning platforms. Employing a random forest model customized to the student's data, we achieved remarkable classification performance, with test accuracies of 97.6% in the computer-based learning setting and 98% in the virtual reality setting. These results underscore the effectiveness of our approach in delivering personalized insights into student concentration during online educational activities.

Title: Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps

Authors: Yen-Che Hsiao, Abhishek Dutta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15120
Pdf URL: https://arxiv.org/pdf/2502.15120
Copy Paste: [[2502.15120]] Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps(https://arxiv.org/abs/2502.15120)
Keywords: interpretability, transformer
Abstract: This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: this https URL.

Title: DAM-Seg: Anatomically accurate cardiac segmentation using Dense Associative Networks

Authors: Zahid Ullah, Jihie Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15128
Pdf URL: https://arxiv.org/pdf/2502.15128
Copy Paste: [[2502.15128]] DAM-Seg: Anatomically accurate cardiac segmentation using Dense Associative Networks(https://arxiv.org/abs/2502.15128)
Keywords: robust, transformer, segmentation
Abstract: Deep learning-based cardiac segmentation has seen significant advancements over the years. Many studies have tackled the challenge of anatomically incorrect segmentation predictions by introducing auxiliary modules. These modules either post-process segmentation outputs or enforce consistency between specific points to ensure anatomical correctness. However, such approaches often increase network complexity, require separate training for these modules, and may lack robustness in scenarios with poor visibility. To address these limitations, we propose a novel transformer-based architecture that leverages dense associative networks to learn and retain specific patterns inherent to cardiac inputs. Unlike traditional methods, our approach restricts the network to memorize a limited set of patterns. During forward propagation, a weighted sum of these patterns is used to enforce anatomical correctness in the output. Since these patterns are input-independent, the model demonstrates enhanced robustness, even in cases with poor visibility. The proposed pipeline was evaluated on two publicly available datasets, CAMUS and CardiacNet. Experimental results indicate that our model consistently outperforms baseline approaches across all metrics, highlighting its effectiveness and reliability for cardiac segmentation tasks.

Title: TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Authors: Xiuwei Chen, Sihao Lin, Xiao Dong, Zisheng Chen, Meng Cao, Jianhua Han, Hang Xu, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15130
Pdf URL: https://arxiv.org/pdf/2502.15130
Copy Paste: [[2502.15130]] TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba(https://arxiv.org/abs/2502.15130)
Keywords: transformer
Abstract: Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.

Title: CoT-ICL Lab: A Petri Dish for Studying Chain-of-Thought Learning from In-Context Demonstrations

Authors: Vignesh Kothapalli, Hamed Firooz, Maziar Sanjabi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15132
Pdf URL: https://arxiv.org/pdf/2502.15132
Copy Paste: [[2502.15132]] CoT-ICL Lab: A Petri Dish for Studying Chain-of-Thought Learning from In-Context Demonstrations(https://arxiv.org/abs/2502.15132)
Keywords: transformer
Abstract: We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

Title: Chain-of-Rank: Enhancing Large Language Models for Domain-Specific RAG in Edge Device

Authors: Juntae Lee, Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15134
Pdf URL: https://arxiv.org/pdf/2502.15134
Copy Paste: [[2502.15134]] Chain-of-Rank: Enhancing Large Language Models for Domain-Specific RAG in Edge Device(https://arxiv.org/abs/2502.15134)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) with large language models (LLMs) is especially valuable in specialized domains, where precision is critical. To more specialize the LLMs into a target domain, domain-specific RAG has recently been developed by allowing the LLM to access the target domain early via finetuning. The domain-specific RAG makes more sense in resource-constrained environments like edge devices, as they should perform a specific task (e.g. personalization) reliably using only small-scale LLMs. While the domain-specific RAG is well-aligned with edge devices in this respect, it often relies on widely-used reasoning techniques like chain-of-thought (CoT). The reasoning step is useful to understand the given external knowledge, and yet it is computationally expensive and difficult for small-scale LLMs to learn it. Tackling this, we propose the Chain of Rank (CoR) which shifts the focus from intricate lengthy reasoning to simple ranking of the reliability of input external documents. Then, CoR reduces computational complexity while maintaining high accuracy, making it particularly suited for resource-constrained environments. We attain the state-of-the-art (SOTA) results in benchmarks, and analyze its efficacy.

Title: Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

Authors: Naiming Liu, Shashank Sonkar, Richard G. Baraniuk
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.15140
Pdf URL: https://arxiv.org/pdf/2502.15140
Copy Paste: [[2502.15140]] Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns(https://arxiv.org/abs/2502.15140)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs' strong performance on generating educational content, there remains a gap between LLM's underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.

Title: Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation

Authors: Ebenezer Tarubinga, Jenifer Kalafatovich Espinoza
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15152
Pdf URL: https://arxiv.org/pdf/2502.15152
Copy Paste: [[2502.15152]] Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation(https://arxiv.org/abs/2502.15152)
Keywords: segmentation
Abstract: Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilising unlabeled data alongside limited labeled samples. Existing SSSS methods often face challenges such as coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by insufficient boundary-awareness and ambiguous edge information. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS) remain under-explored in SSSS. Specifically, our approach: (1) reduces coupling through a confidence-weighted loss function that adjusts the influence of pseudo-labels based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) resolves boundary blur with a boundary-aware module that enhances segmentation accuracy near object boundaries, and (4) reduces label noise with a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on the Pascal VOC 2012 and Cityscapes demonstrate that our method achieves state-of-the-art performance. Moreover, using only 1/8 or 12.5\% of labeled data, our method achieves a mIoU of 75.81 on Pascal VOC 2012, highlighting its effectiveness in limited-label settings.

Title: Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems

Authors: Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15153
Pdf URL: https://arxiv.org/pdf/2502.15153
Copy Paste: [[2502.15153]] Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems(https://arxiv.org/abs/2502.15153)
Keywords: robust, large language model
Abstract: Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of corporation and tool use in multi-agent systems (MASs). However, the robustness of these LLM-based MASs, especially under knowledge conflicts, remains unclear. In this paper, we design four comprehensive metrics to investigate the robustness of MASs when facing mild or task-critical knowledge conflicts. We first analyze mild knowledge conflicts introduced by heterogeneous agents and find that they do not harm system robustness but instead improve collaborative decision-making. Next, we investigate task-critical knowledge conflicts by synthesizing knowledge conflicts and embedding them into one of the agents. Our results show that these conflicts have surprisingly little to no impact on MAS robustness. Furthermore, we observe that MASs demonstrate certain self-repairing capabilities by reducing their reliance on knowledge conflicts and adopting alternative solution paths to maintain stability. Finally, we conduct ablation studies on the knowledge conflict number, agent number, and interaction rounds, finding that the self-repairing capability of MASs has intrinsic limits, and all findings hold consistently across various factors. Our code is publicly available at this https URL.

Title: Extreme Speech Classification in the Era of LLMs: Exploring Open-Source and Proprietary Models

Authors: Sarthak Mahajan, Nimmi Rangaswamy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15155
Pdf URL: https://arxiv.org/pdf/2502.15155
Copy Paste: [[2502.15155]] Extreme Speech Classification in the Era of LLMs: Exploring Open-Source and Proprietary Models(https://arxiv.org/abs/2502.15155)
Keywords: large language model
Abstract: In recent years, widespread internet adoption and the growth in userbase of various social media platforms have led to an increase in the proliferation of extreme speech online. While traditional language models have demonstrated proficiency in distinguishing between neutral text and non-neutral text (i.e. extreme speech), categorizing the diverse types of extreme speech presents significant challenges. The task of extreme speech classification is particularly nuanced, as it requires a deep understanding of socio-cultural contexts to accurately interpret the intent of the language used by the speaker. Even human annotators often disagree on the appropriate classification of such content, emphasizing the complex and subjective nature of this task. The use of human moderators also presents a scaling issue, necessitating the need for automated systems for extreme speech classification. The recent launch of ChatGPT has drawn global attention to the potential applications of Large Language Models (LLMs) across a diverse variety of tasks. Trained on vast and diverse corpora, and demonstrating the ability to effectively capture and encode contextual information, LLMs emerge as highly promising tools for tackling this specific task of extreme speech classification. In this paper, we leverage the Indian subset of the extreme speech dataset from Maronikolakis et al. (2022) to develop an effective classification framework using LLMs. We evaluate open-source Llama models against closed-source OpenAI models, finding that while pre-trained LLMs show moderate efficacy, fine-tuning with domain-specific data significantly enhances performance, highlighting their adaptability to linguistic and contextual nuances. Although GPT-based models outperform Llama models in zero-shot settings, the performance gap disappears after fine-tuning.

Title: M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment

Authors: Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15167
Pdf URL: https://arxiv.org/pdf/2502.15167
Copy Paste: [[2502.15167]] M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment(https://arxiv.org/abs/2502.15167)
Keywords: large language model
Abstract: The rapid advancement of AI-generated image (AGI) models has introduced significant challenges in evaluating their quality, which requires considering multiple dimensions such as perceptual quality, prompt correspondence, and authenticity. To address these challenges, we propose M3-AGIQA, a comprehensive framework for AGI quality assessment that is Multimodal, Multi-Round, and Multi-Aspect. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) as joint text and image encoders and distills advanced captioning capabilities from online MLLMs into a local model via Low-Rank Adaptation (LoRA) fine-tuning. The framework includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated to provide deeper insights into the quality, correspondence, and authenticity aspects. To align predictions with human perceptual judgments, a predictor constructed by an xLSTM and a regression head is incorporated to process sequential logits and predict Mean Opinion Scores (MOSs). Extensive experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance, effectively capturing nuanced aspects of AGI quality. Furthermore, cross-dataset validation confirms its strong generalizability. The code is available at this https URL.

Title: Methods and Trends in Detecting Generated Images: A Comprehensive Review

Authors: Arpan Mahara, Naphtali Rishe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15176
Pdf URL: https://arxiv.org/pdf/2502.15176
Copy Paste: [[2502.15176]] Methods and Trends in Detecting Generated Images: A Comprehensive Review(https://arxiv.org/abs/2502.15176)
Keywords: attack, diffusion, generative
Abstract: The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have primarily focused on deepfake detection and often lack coverage of recent advancements in synthetic image detection, particularly methods leveraging multimodal frameworks for improved forensic analysis. To address this gap, the present survey provides a comprehensive review of state-of-the-art methods for detecting and classifying synthetic images generated by advanced generative AI models. This review systematically examines core detection methodologies, identifies commonalities among approaches, and categorizes them into meaningful taxonomies. Furthermore, given the crucial role of large-scale datasets in this field, we present an overview of publicly available datasets that facilitate further research and benchmarking in synthetic data detection.

Title: Optimizing Product Provenance Verification using Data Valuation Methods

Authors: Raquib Bin Yousuf, Hoang Anh Just, Shengzhe Xu, Brian Mayer, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Jade Saunders, Chang-Tien Lu, Ruoxi Jia, Naren Ramakrishnan
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2502.15177
Pdf URL: https://arxiv.org/pdf/2502.15177
Copy Paste: [[2502.15177]] Optimizing Product Provenance Verification using Data Valuation Methods(https://arxiv.org/abs/2502.15177)
Keywords: robust
Abstract: Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. However, the effectiveness of these models is often constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. We validate our methodology with extensive experiments, demonstrating its potential to significantly enhance provenance verification, mitigate fraudulent trade practices, and strengthen regulatory enforcement of global supply chains.

Title: Nonlinear Dynamical Systems for Automatic Face Annotation in Head Tracking and Pose Estimation

Authors: Thoa Thieu, Roderick Melnik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15179
Pdf URL: https://arxiv.org/pdf/2502.15179
Copy Paste: [[2502.15179]] Nonlinear Dynamical Systems for Automatic Face Annotation in Head Tracking and Pose Estimation(https://arxiv.org/abs/2502.15179)
Keywords: robust
Abstract: Facial landmark tracking plays a vital role in applications such as facial recognition, expression analysis, and medical diagnostics. In this paper, we consider the performance of the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) in tracking 3D facial motion in both deterministic and stochastic settings. We first analyze a noise-free environment where the state transition is purely deterministic, demonstrating that UKF outperforms EKF by achieving lower mean squared error (MSE) due to its ability to capture higher-order nonlinearities. However, when stochastic noise is introduced, EKF exhibits superior robustness, maintaining lower mean square error (MSE) compared to UKF, which becomes more sensitive to measurement noise and occlusions. Our results highlight that UKF is preferable for high-precision applications in controlled environments, whereas EKF is better suited for real-world scenarios with unpredictable noise. These findings provide practical insights for selecting the appropriate filtering technique in 3D facial tracking applications, such as motion capture and facial recognition.

Title: Hierarchical Context Transformer for Multi-level Semantic Scene Understanding

Authors: Luoying Hao, Yan Hu, Yang Yue, Li Wu, Huazhu Fu, Jinming Duan, Jiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15184
Pdf URL: https://arxiv.org/pdf/2502.15184
Copy Paste: [[2502.15184]] Hierarchical Context Transformer for Multi-level Semantic Scene Understanding(https://arxiv.org/abs/2502.15184)
Keywords: transformer
Abstract: A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at this https URL.

Title: Image Translation-Based Unsupervised Cross-Modality Domain Adaptation for Medical Image Segmentation

Authors: Tao Yang, Lisheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15193
Pdf URL: https://arxiv.org/pdf/2502.15193
Copy Paste: [[2502.15193]] Image Translation-Based Unsupervised Cross-Modality Domain Adaptation for Medical Image Segmentation(https://arxiv.org/abs/2502.15193)
Keywords: segmentation
Abstract: Supervised deep learning usually faces more challenges in medical images than in natural images. Since annotations in medical images require the expertise of doctors and are more time-consuming and expensive. Thus, some researchers turn to unsupervised learning methods, which usually face inevitable performance drops. In addition, medical images may have been acquired at different medical centers with different scanners and under different image acquisition protocols, so the modalities of the medical images are often inconsistent. This modality difference (domain shift) also reduces the applicability of deep learning methods. In this regard, we propose an unsupervised crossmodality domain adaptation method based on image translation by transforming the source modality image with annotation into the unannotated target modality and using its annotation to achieve supervised learning of the target modality. In addition, the subtle differences between translated pseudo images and real images are overcome by self-training methods to further improve the task performance of deep learning. The proposed method showed mean Dice Similarity Coefficient (DSC) and Average Symmetric Surface Distance (ASSD) of $0.8351 \pm 0.1152$ and $1.6712 \pm 2.1948$ for vestibular schwannoma (VS), $0.8098 \pm 0.0233$ and $0.2317 \pm 0.1577$ for cochlea on the VS and cochlea segmentation task of the Cross-Modality Domain Adaptation (crossMoDA 2022) challenge validation phase leaderboard.

Title: TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Authors: Zhaoxuan Wu, Zijian Zhou, Arun Verma, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15197
Pdf URL: https://arxiv.org/pdf/2502.15197
Copy Paste: [[2502.15197]] TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding(https://arxiv.org/abs/2502.15197)
Keywords: large language model
Abstract: We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

Title: UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction

Authors: Chenyu Li, Danfeng Hong, Bing Zhang, Yuxuan Li, Gustau Camps-Valls, Xiao Xiang Zhu, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15199
Pdf URL: https://arxiv.org/pdf/2502.15199
Copy Paste: [[2502.15199]] UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction(https://arxiv.org/abs/2502.15199)
Keywords: extraction, segmentation
Abstract: Object extraction and segmentation from remote sensing (RS) images is a critical yet challenging task in urban environment monitoring. Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. These challenges are amplified by heterogeneity and scale disparities across RS data sources, including sensors, platforms, and modalities, making accurate object segmentation particularly demanding. While the Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes, its performance in handling form-varying objects remains limited due to manual-interactive prompting. To this end, we propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments while tackling scaling effects from remotely sensed observations. Inspired by multi-resolution analysis (MRA) theory, UrbanSAM incorporates a novel learnable prompter equipped with a Uscaling-Adapter that adheres to the invariance criterion, enabling the model to capture multiscale contextual information of objects and adapt to arbitrary scale variations with theoretical guarantees. Furthermore, features from the Uscaling-Adapter and the trunk encoder are aligned through a masked cross-attention operation, allowing the trunk encoder to inherit the adapter's multiscale aggregation capability. This synergy enhances the segmentation performance, resulting in more powerful and accurate outputs, supported by the learned adapter. Extensive experimental results demonstrate the flexibility and superior segmentation performance of the proposed UrbanSAM on a global-scale dataset, encompassing scale-varying urban objects such as buildings, roads, and water.

Title: FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation

Authors: Young Beom Woo, Sun Eung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15203
Pdf URL: https://arxiv.org/pdf/2502.15203
Copy Paste: [[2502.15203]] FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation(https://arxiv.org/abs/2502.15203)
Keywords: protect
Abstract: Recently, methods that integrate multiple personalized concepts into a single image have garnered significant attention in the field of text-to-image (T2I) generation. However, existing methods experience performance degradation in complex scenes with multiple objects due to distortions in non-personalized regions. To address this issue, we propose FlipConcept, a novel approach that seamlessly integrates multiple personalized concepts into a single image without requiring additional tuning. We introduce guided appearance attention to accurately mimic the appearance of a personalized concept as intended. Additionally, we introduce mask-guided noise mixing to protect non-personalized regions during editing. Lastly, we apply background dilution to minimize attribute leakage, which is the undesired blending of personalized concept attributes with other objects in the image. In our experiments, we demonstrate that the proposed method, despite not requiring tuning, outperforms existing models in both single and multiple personalized concept inference.

Title: Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing

Authors: Zhilin Wang, Yafu Li, Jianhao Yan, Yu Cheng, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15208
Pdf URL: https://arxiv.org/pdf/2502.15208
Copy Paste: [[2502.15208]] Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing(https://arxiv.org/abs/2502.15208)
Keywords: generative, large language model
Abstract: Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.

Title: The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

Authors: Sheila Schoepp, Masoud Jafaripour, Yingyue Cao, Tianpei Yang, Fatemeh Abdollahi, Shadan Golestan, Zahin Sufiyan, Osmar R. Zaiane, Matthew E. Taylor
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15214
Pdf URL: https://arxiv.org/pdf/2502.15214
Copy Paste: [[2502.15214]] The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning(https://arxiv.org/abs/2502.15214)
Keywords: large language model
Abstract: Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.

Title: Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Authors: Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15224
Pdf URL: https://arxiv.org/pdf/2502.15224
Copy Paste: [[2502.15224]] Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs(https://arxiv.org/abs/2502.15224)
Keywords: large language model
Abstract: Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

Title: Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews

Authors: Mengqiao Liu, Tevin Wang, Cassandra A. Cohen, Sarah Li, Chenyan Xiong
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2502.15226
Pdf URL: https://arxiv.org/pdf/2502.15226
Copy Paste: [[2502.15226]] Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews(https://arxiv.org/abs/2502.15226)
Keywords: large language model
Abstract: Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interacted with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, for example, the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our collected chat-and-interview logs will be released.

Title: AutoMR: A Universal Time Series Motion Recognition Pipeline

Authors: Likun Zhang, Sicheng Yang, Zhuo Wang, Haining Liang, Junxiao Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15228
Pdf URL: https://arxiv.org/pdf/2502.15228
Copy Paste: [[2502.15228]] AutoMR: A Universal Time Series Motion Recognition Pipeline(https://arxiv.org/abs/2502.15228)
Keywords: robust
Abstract: In this paper, we present an end-to-end automated motion recognition (AutoMR) pipeline designed for multimodal datasets. The proposed framework seamlessly integrates data preprocessing, model training, hyperparameter tuning, and evaluation, enabling robust performance across diverse scenarios. Our approach addresses two primary challenges: 1) variability in sensor data formats and parameters across datasets, which traditionally requires task-specific machine learning implementations, and 2) the complexity and time consumption of hyperparameter tuning for optimal model performance. Our library features an all-in-one solution incorporating QuartzNet as the core model, automated hyperparameter tuning, and comprehensive metrics tracking. Extensive experiments demonstrate its effectiveness on 10 diverse datasets, achieving state-of-the-art performance. This work lays a solid foundation for deploying motion-capture solutions across varied real-world applications.

Title: A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation

Authors: Shilong Hou, Ruilin Shang, Zi Long, Xianghua Fu, Yin Chen
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15233
Pdf URL: https://arxiv.org/pdf/2502.15233
Copy Paste: [[2502.15233]] A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation(https://arxiv.org/abs/2502.15233)
Keywords: privacy, protect, large language model
Abstract: An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at this https URL.

Title: Multi-agent Multi-armed Bandits with Minimum Reward Guarantee Fairness

Authors: Piyushi Manupriya, Himanshu, SakethaNath Jagarlapudi, Ganesh Ghalme
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2502.15240
Pdf URL: https://arxiv.org/pdf/2502.15240
Copy Paste: [[2502.15240]] Multi-agent Multi-armed Bandits with Minimum Reward Guarantee Fairness(https://arxiv.org/abs/2502.15240)
Keywords: fair
Abstract: We investigate the problem of maximizing social welfare while ensuring fairness in a multi-agent multi-armed bandit (MA-MAB) setting. In this problem, a centralized decision-maker takes actions over time, generating random rewards for various agents. Our goal is to maximize the sum of expected cumulative rewards, a.k.a. social welfare, while ensuring that each agent receives an expected reward that is at least a constant fraction of the maximum possible expected reward. Our proposed algorithm, RewardFairUCB, leverages the Upper Confidence Bound (UCB) technique to achieve sublinear regret bounds for both fairness and social welfare. The fairness regret measures the positive difference between the minimum reward guarantee and the expected reward of a given policy, whereas the social welfare regret measures the difference between the social welfare of the optimal fair policy and that of the given policy. We show that RewardFairUCB algorithm achieves instance-independent social welfare regret guarantees of $\tilde{O}(T^{1/2})$ and a fairness regret upper bound of $\tilde{O}(T^{3/4})$. We also give the lower bound of $\Omega(\sqrt{T})$ for both social welfare and fairness regret. We evaluate RewardFairUCB's performance against various baseline and heuristic algorithms using simulated data and real world data, highlighting trade-offs between fairness and social welfare regrets.

Title: Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models

Authors: Amartaivan Sanjjamts, Hiroshi Morita, Togootogtokh Enkhtogtokh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15252
Pdf URL: https://arxiv.org/pdf/2502.15252
Copy Paste: [[2502.15252]] Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models(https://arxiv.org/abs/2502.15252)
Keywords: robust, transformer
Abstract: Understanding collective pedestrian movement is crucial for applications in crowd management, autonomous navigation, and human-robot interaction. This paper investigates the use of sequential deep learning models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers, for real-time flock detection in multi-pedestrian trajectories. Our proposed approach consists of a two-stage process: first, a pre-trained binary classification model is used for pairwise trajectory classification, and second, the learned representations are applied to identify multi-agent flocks dynamically. We validate our method using real-world group movement datasets, demonstrating its robustness across varying sequence lengths and diverse movement patterns. Experimental results indicate that our model consistently detects pedestrian flocks with high accuracy and stability, even in dynamic and noisy environments. Furthermore, we extend our approach to identify other forms of collective motion, such as convoys and swarms, paving the way for more comprehensive multi-agent behavior analysis.

Title: LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design

Authors: Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15260
Pdf URL: https://arxiv.org/pdf/2502.15260
Copy Paste: [[2502.15260]] LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design(https://arxiv.org/abs/2502.15260)
Keywords: transformer, large language model
Abstract: State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline.

Title: Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction

Authors: Jingheng Ye, Shang Qin, Yinghui Li, Hai-Tao Zheng, Shen Wang, Qingsong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15261
Pdf URL: https://arxiv.org/pdf/2502.15261
Copy Paste: [[2502.15261]] Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction(https://arxiv.org/abs/2502.15261)
Keywords: explainability, generative
Abstract: Grammatical Error Correction (GEC) faces a critical challenge concerning explainability, notably when GEC systems are designed for language learners. Existing research predominantly focuses on explaining grammatical errors extracted in advance, thus neglecting the relationship between explanations and corrections. To address this gap, we introduce EXGEC, a unified explainable GEC framework that integrates explanation and correction tasks in a generative manner, advocating that these tasks mutually reinforce each other. Experiments have been conducted on EXPECT, a recent human-labeled dataset for explainable GEC, comprising around 20k samples. Moreover, we detect significant noise within EXPECT, potentially compromising model training and evaluation. Therefore, we introduce an alternative dataset named EXPECT-denoised, ensuring a more objective framework for training and evaluation. Results on various NLP models (BART, T5, and Llama3) show that EXGEC models surpass single-task baselines in both tasks, demonstrating the effectiveness of our approach.

Title: Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Authors: Peng Shen, Xugang Lu, Hisashi Kawai
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.15264
Pdf URL: https://arxiv.org/pdf/2502.15264
Copy Paste: [[2502.15264]] Retrieval-Augmented Speech Recognition Approach for Domain Challenges(https://arxiv.org/abs/2502.15264)
Keywords: large language model
Abstract: Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.

Title: A Training-free LLM-based Approach to General Chinese Character Error Correction

Authors: Houquan Zhou, Bo Zhang, Zhenghua Li, Ming Yan, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15266
Pdf URL: https://arxiv.org/pdf/2502.15266
Copy Paste: [[2502.15266]] A Training-free LLM-based Approach to General Chinese Character Error Correction(https://arxiv.org/abs/2502.15266)
Keywords: large language model
Abstract: Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

Title: On the (In)Security of Non-resettable Device Identifiers in Custom Android Systems

Authors: Zikan Dong, Liu Wang, Guoai Xu, Haoyu Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15270
Pdf URL: https://arxiv.org/pdf/2502.15270
Copy Paste: [[2502.15270]] On the (In)Security of Non-resettable Device Identifiers in Custom Android Systems(https://arxiv.org/abs/2502.15270)
Keywords: security, privacy
Abstract: User tracking is critical in the mobile ecosystem, which relies on device identifiers to build clear user profiles. In earlier ages, Android allowed easy access to non-resettable device identifiers like device serial numbers and IMEI by third-party apps for user tracking. As privacy concerns grew, Google has tightened restrictions on these identifiers in native Android. Despite this, stakeholders in custom Android systems seek consistent and stable user tracking capabilities across different system and device models, and they have introduced covert channels (e.g., system properties and settings) in customized systems to access identifiers, which undoubtedly increases the risk of user privacy breaches. This paper examines the introduction of non-resettable identifiers through system customization and their vulnerability due to poor access control. We present IDRadar, a scalable and accurate approach for identifying vulnerable properties and settings on custom Android ROMs. Applying our approach to 1,814 custom ROMs, we have identified 8,192 system properties and 3,620 settings that store non-resettable identifiers, with 3,477 properties and 1,336 settings lacking adequate access control, which can be abused by third-party apps to track users without permissions. Our large-scale analysis can identify a large number of security issues which are two orders of magnitude greater than existing techniques. We further investigate the root causes of these access control deficiencies. Validation on 32 devices through the remote testing service confirmed our results. Additionally, we observe that the vulnerable properties and settings occur in devices of the same OEMs. We have reported our findings to the vendors and received positive confirmations. Our work underscores the need for greater scrutiny of covert access channels to device identifiers and better solutions to safeguard user privacy.

Title: Analyzing the Inner Workings of Transformers in Compositional Generalization

Authors: Ryoma Kumon, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15277
Pdf URL: https://arxiv.org/pdf/2502.15277
Copy Paste: [[2502.15277]] Analyzing the Inner Workings of Transformers in Compositional Generalization(https://arxiv.org/abs/2502.15277)
Keywords: transformer
Abstract: The compositional generalization abilities of neural models have been sought after for human-like linguistic competence. The popular method to evaluate such abilities is to assess the models' input-output behavior. However, that does not reveal the internal mechanisms, and the underlying competence of such models in compositional generalization remains unclear. To address this problem, we explore the inner workings of a Transformer model by finding an existing subnetwork that contributes to the generalization performance and by performing causal analyses on how the model utilizes syntactic features. We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm in addition to the syntactic features. We also show that the subnetwork improves its generalization performance relatively slowly during the training compared to the in-distribution one, and the non-compositional solution is acquired in the early stages of the training.

Title: CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models

Authors: Shunchang Liu, Zhuan Shi, Lingjuan Lyu, Yaochu Jin, Boi Faltings
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15278
Pdf URL: https://arxiv.org/pdf/2502.15278
Copy Paste: [[2502.15278]] CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2502.15278)
Keywords: interpretability, diffusion
Abstract: Assessing whether AI-generated images are substantially similar to copyrighted works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, an automated copyright infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework with multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on the judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Besides, our approach can be enhanced by exploring non-infringing noise vectors within the diffusion latent space via reinforcement learning, even without modifying the original prompts. Experimental results show that our identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method could more effectively mitigate memorization and IP infringement without losing non-infringing expressions.

Title: DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

Authors: Chengyan Ma, Ruidong Han, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2502.15281
Pdf URL: https://arxiv.org/pdf/2502.15281
Copy Paste: [[2502.15281]] DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications(https://arxiv.org/abs/2502.15281)
Keywords: secure, security
Abstract: Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or being adversely affected by malicious inputs from the normal world. To address this, we propose an approach to detect partitioning issues in TEE applications. First, we conducted a survey of TEE vulnerabilities caused by bad partitioning and found that the parameters exchanged between the secure and normal worlds often contain insecure usage with bad partitioning implementation. Second, we developed a tool named DITING that can analyze data-flows of these parameters and identify their violations of security rules we defined to find bad partitioning issues. Different from existing research that only focuses on malicious input to TEE, we assess the partitioning issues more comprehensively through input/output and shared memory. Finally, we created the first benchmark targeting bad partitioning, consisting of 110 test cases. Experiments demonstrate the DITING achieves an F1 score of 0.90 in identifying bad partitioning issues.

Title: Soybean pod and seed counting in both outdoor fields and indoor laboratories using unions of deep neural networks

Authors: Tianyou Jiang, Mingshun Shao, Tianyi Zhang, Xiaoyu Liu, Qun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15286
Pdf URL: https://arxiv.org/pdf/2502.15286
Copy Paste: [[2502.15286]] Soybean pod and seed counting in both outdoor fields and indoor laboratories using unions of deep neural networks(https://arxiv.org/abs/2502.15286)
Keywords: robust, transformer
Abstract: Automatic counting soybean pods and seeds in outdoor fields allows for rapid yield estimation before harvesting, while indoor laboratory counting offers greater accuracy. Both methods can significantly accelerate the breeding process. However, it remains challenging for accurately counting pods and seeds in outdoor fields, and there are still no accurate enough tools for counting pods and seeds in laboratories. In this study, we developed efficient deep learning models for counting soybean pods and seeds in both outdoor fields and indoor laboratories. For outdoor fields, annotating not only visible seeds but also occluded seeds makes YOLO have the ability to estimate the number of soybean seeds that are occluded. Moreover, we enhanced YOLO architecture by integrating it with HQ-SAM (YOLO-SAM), and domain adaptation techniques (YOLO-DA), to improve model robustness and generalization across soybean images taken in outdoor fields. Testing on soybean images from the outdoor field, we achieved a mean absolute error (MAE) of 6.13 for pod counting and 10.05 for seed counting. For the indoor setting, we utilized Mask-RCNN supplemented with a Swin Transformer module (Mask-RCNN-Swin), models were trained exclusively on synthetic training images generated from a small set of labeled data. This approach resulted in near-perfect accuracy, with an MAE of 1.07 for pod counting and 1.33 for seed counting across actual laboratory images from two distinct studies.

Title: Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Authors: Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15294
Pdf URL: https://arxiv.org/pdf/2502.15294
Copy Paste: [[2502.15294]] Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference(https://arxiv.org/abs/2502.15294)
Keywords: large language model
Abstract: The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. We propose Round Attention, a novel round-level attention mechanism that only recalls and computes the KV cache of the most relevant rounds. The experiments show that our method saves 55\% memory usage without compromising model performance.

Title: SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

Authors: Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15304
Pdf URL: https://arxiv.org/pdf/2502.15304
Copy Paste: [[2502.15304]] SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention(https://arxiv.org/abs/2502.15304)
Keywords: large language model
Abstract: For the efficient inference of Large Language Models (LLMs), the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.

Title: Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder

Authors: Zhenghao Xi, Yuchao Shao, Yang Zheng, Xiang Liu, Yaqi Liu, Yitong Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15307
Pdf URL: https://arxiv.org/pdf/2502.15307
Copy Paste: [[2502.15307]] Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder(https://arxiv.org/abs/2502.15307)
Keywords: robust
Abstract: Traffic signs recognition (TSR) plays an essential role in assistant driving and intelligent transportation system. However, the noise of complex environment may lead to motion-blur or occlusion problems, which raise the tough challenge to real-time recognition with high accuracy and robust. In this article, we propose IECES-network which with improved encoders and Siamese net. The three-stage approach of our method includes Efficient-CNN based encoders, Siamese backbone and the fully-connected layers. We firstly use convolutional encoders to extract and encode the traffic sign features of augmented training samples and standard images. Then, we design the Siamese neural network with Efficient-CNN based encoder and contrastive loss function, which can be trained to improve the robustness of TSR problem when facing the samples of motion-blur and occlusion by computing the distance between inputs and templates. Additionally, the template branch of the proposed network can be stopped when executing the recognition tasks after training to raise the process speed of our real-time model, and alleviate the computational resource and parameter scale. Finally, we recombined the feature code and a fully-connected layer with SoftMax function to classify the codes of samples and recognize the category of traffic signs. The results of experiments on the Tsinghua-Tencent 100K dataset and the German Traffic Sign Recognition Benchmark dataset demonstrate the performance of the proposed IECESnetwork. Compared with other state-of-the-art methods, in the case of motion-blur and occluded environment, the proposed method achieves competitive performance precision-recall and accuracy metric average is 88.1%, 86.43% and 86.1% with a 2.9M lightweight scale, respectively. Moreover, processing time of our model is 0.1s per frame, of which the speed is increased by 1.5 times compared with existing methods.

Title: Tight Clusters Make Specialized Experts

Authors: Stefan K. Nielsen, Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15315
Pdf URL: https://arxiv.org/pdf/2502.15315
Copy Paste: [[2502.15315]] Tight Clusters Make Specialized Experts(https://arxiv.org/abs/2502.15315)
Keywords: robust
Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

Title: SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis

Authors: Bin Feng, Shulan Ruan, Mingzheng Yang, Dongxuan Han, Huijie Liu, Kai Zhang, Qi Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15322
Pdf URL: https://arxiv.org/pdf/2502.15322
Copy Paste: [[2502.15322]] SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis(https://arxiv.org/abs/2502.15322)
Keywords: transformer
Abstract: As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.

Title: Detecting Future-related Contexts of Entity Mentions

Authors: Puneet Prashar, Krishna Mohan Shukla, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.15332
Pdf URL: https://arxiv.org/pdf/2502.15332
Copy Paste: [[2502.15332]] Detecting Future-related Contexts of Entity Mentions(https://arxiv.org/abs/2502.15332)
Keywords: large language model
Abstract: The ability to automatically identify whether an entity is referenced in a future context can have multiple applications including decision making, planning and trend forecasting. This paper focuses on detecting implicit future references in entity-centric texts, addressing the growing need for automated temporal analysis in information processing. We first present a novel dataset of 19,540 sentences built around popular entities sourced from Wikipedia, which consists of future-related and non-future-related contexts in which those entities appear. As a second contribution, we evaluate the performance of several Language Models including also Large Language Models (LLMs) on the task of distinguishing future-oriented content in the absence of explicit temporal references.

Title: Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Authors: Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15334
Pdf URL: https://arxiv.org/pdf/2502.15334
Copy Paste: [[2502.15334]] Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment(https://arxiv.org/abs/2502.15334)
Keywords: defense, attack, large language model
Abstract: Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

Title: Stepwise Informativeness Search for Improving LLM Reasoning

Authors: Siyuan Wang, Enda Zhao, Zhongyu Wei, Xiang Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15335
Pdf URL: https://arxiv.org/pdf/2502.15335
Copy Paste: [[2502.15335]] Stepwise Informativeness Search for Improving LLM Reasoning(https://arxiv.org/abs/2502.15335)
Keywords: large language model
Abstract: Advances in Large Language Models (LLMs) have significantly improved multi-step reasoning through generating free-text rationales. However, recent studies show that LLMs tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to generate unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise step-by-step rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference-time tree search framework incorporating two selection heuristics: grounding-guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty-guided selection which encourages steps with novel conclusions. During rationale generation, we use a self-grounding strategy that prompts LLMs to explicitly reference relevant prior steps to provide premises before deduction at each step. Experimental results on four reasoning datasets demonstrate that our approach improves reasoning accuracy by generating higher-quality rationales with reduced errors and redundancy.

Title: Learning with Limited Shared Information in Multi-agent Multi-armed Bandit

Authors: Junning Shao, Siwei Wang, Zhixuan Fang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15338
Pdf URL: https://arxiv.org/pdf/2502.15338
Copy Paste: [[2502.15338]] Learning with Limited Shared Information in Multi-agent Multi-armed Bandit(https://arxiv.org/abs/2502.15338)
Keywords: privacy
Abstract: Multi-agent multi-armed bandit (MAMAB) is a classic collaborative learning model and has gained much attention in recent years. However, existing studies do not consider the case where an agent may refuse to share all her information with others, e.g., when some of the data contains personal privacy. In this paper, we propose a novel limited shared information multi-agent multi-armed bandit (LSI-MAMAB) model in which each agent only shares the information that she is willing to share, and propose the Balanced-ETC algorithm to help multiple agents collaborate efficiently with limited shared information. Our analysis shows that Balanced-ETC is asymptotically optimal and its average regret (on each agent) approaches a constant when there are sufficient agents involved. Moreover, to encourage agents to participate in this collaborative learning, an incentive mechanism is proposed to make sure each agent can benefit from the collaboration system. Finally, we present experimental results to validate our theoretical results.

Title: PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments

Authors: Yueting Liu, Hanshi Wang, Yunfei Lei, Zhengjun Zha, Weiming Hu, Jin Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15342
Pdf URL: https://arxiv.org/pdf/2502.15342
Copy Paste: [[2502.15342]] PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments(https://arxiv.org/abs/2502.15342)
Keywords: segmentation
Abstract: Recent advancements in autonomous driving perception have revealed exceptional capabilities within structured environments dominated by vehicular traffic. However, current perception models exhibit significant limitations in semi-structured environments, where dynamic pedestrians with more diverse irregular movement and occlusion prevail. We attribute this shortcoming to the scarcity of high-quality datasets in semi-structured scenes, particularly concerning pedestrian perception and prediction. In this work, we present the multi-modal Pedestrian-Focused Scene Dataset(PFSD), rigorously annotated in semi-structured scenes with the format of nuScenes. PFSD provides comprehensive multi-modal data annotations with point cloud segmentation, detection, and object IDs for tracking. It encompasses over 130,000 pedestrian instances captured across various scenarios with varying densities, movement patterns, and occlusions. Furthermore, to demonstrate the importance of addressing the challenges posed by more diverse and complex semi-structured environments, we propose a novel Hybrid Multi-Scale Fusion Network (HMFN). Specifically, to detect pedestrians in densely populated and occluded scenarios, our method effectively captures and fuses multi-scale features using a meticulously designed hybrid framework that integrates sparse and vanilla convolutions. Extensive experiments on PFSD demonstrate that HMFN attains improvement in mean Average Precision (mAP) over existing methods, thereby underscoring its efficacy in addressing the challenges of 3D pedestrian detection in complex semi-structured environments. Coding and benchmark are available.

Title: Tokenization is Sensitive to Language Variation

Authors: Anna Wegmann, Dong Nguyen, David Jurgens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15343
Pdf URL: https://arxiv.org/pdf/2502.15343
Copy Paste: [[2502.15343]] Tokenization is Sensitive to Language Variation(https://arxiv.org/abs/2502.15343)
Keywords: robust
Abstract: Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models for the popular Byte-Pair Encoding algorithm to investigate how key algorithmic design choices impact downstream models' performances: fitting corpus, pre-tokenizer and vocabulary size. We find that the best tokenizer varies on the two task types -- with the pre-tokenizer having the biggest impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing significant improvement over techniques like Rényi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.

Title: Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Authors: Lixing Lyu, Jiashuo Jiang, Wang Chi Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15345
Pdf URL: https://arxiv.org/pdf/2502.15345
Copy Paste: [[2502.15345]] Efficiently Solving Discounted MDPs with Predictions on Transition Matrices(https://arxiv.org/abs/2502.15345)
Keywords: generative
Abstract: We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $\gamma$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $\epsilon$-optimal policy with a sample complexity bound better than $\tilde{O}((1-\gamma)^{-3} N\epsilon^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $\tilde{O}((1-\gamma)^{-4} N \epsilon^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

Title: Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models

Authors: Yi Zhang, Fan Wei, Jingyi Li, Yan Wang, Yanyan Yu, Jianli Chen, Zipo Cai, Xinyu Liu, Wei Wang, Peng Wang, Zhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15348
Pdf URL: https://arxiv.org/pdf/2502.15348
Copy Paste: [[2502.15348]] Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models(https://arxiv.org/abs/2502.15348)
Keywords: large language model
Abstract: The use of children's drawings to examining their conceptual understanding has been proven to be an effective method, but there are two major problems with previous research: 1. The content of the drawings heavily relies on the task, and the ecological validity of the conclusions is low; 2. The interpretation of drawings relies too much on the subjective feelings of the researchers. To address this issue, this study uses the Large Language Model (LLM) to identify 1420 children's scientific drawings (covering 9 scientific themes/concepts), and uses the word2vec algorithm to calculate their semantic similarity. The study explores whether there are consistent drawing representations for children on the same theme, and attempts to establish a norm for children's scientific drawings, providing a baseline reference for follow-up children's drawing research. The results show that the representation of most drawings has consistency, manifested as most semantic similarity greater than 0.8. At the same time, it was found that the consistency of the representation is independent of the accuracy (of LLM's recognition), indicating the existence of consistency bias. In the subsequent exploration of influencing factors, we used Kendall rank correlation coefficient to investigate the effects of Sample Size, Abstract Degree, and Focus Points on drawings, and used word frequency statistics to explore whether children represented abstract themes/concepts by reproducing what was taught in class.

Title: AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

Authors: Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
Subjects: cs.CL, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2502.15349
Pdf URL: https://arxiv.org/pdf/2502.15349
Copy Paste: [[2502.15349]] AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms(https://arxiv.org/abs/2502.15349)
Keywords: robust, transformer, large language model
Abstract: Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at this https URL.

Title: Evaluating Social Biases in LLM Reasoning

Authors: Xuyang Wu, Jinming Nian, Zhiqiang Tao, Yi Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15361
Pdf URL: https://arxiv.org/pdf/2502.15361
Copy Paste: [[2502.15361]] Evaluating Social Biases in LLM Reasoning(https://arxiv.org/abs/2502.15361)
Keywords: large language model
Abstract: In the recent development of AI reasoning, large language models (LLMs) are trained to automatically generate chain-of-thought reasoning steps, which have demonstrated compelling performance on math and coding tasks. However, when bias is mixed within the reasoning process to form strong logical arguments, it could cause even more harmful results and further induce hallucinations. In this paper, we have evaluated the 8B and 32B variants of DeepSeek-R1 against their instruction tuned counterparts on the BBQ dataset, and investigated the bias that is elicited out and being amplified through reasoning steps. To the best of our knowledge, this empirical study is the first to assess bias issues in LLM reasoning.

Title: Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Authors: Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15370
Pdf URL: https://arxiv.org/pdf/2502.15370
Copy Paste: [[2502.15370]] Weakly Supervised Video Scene Graph Generation via Natural Language Supervision(https://arxiv.org/abs/2502.15370)
Keywords: large language model, segmentation
Abstract: Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.

Title: MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Authors: Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey Kuznetsov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15381
Pdf URL: https://arxiv.org/pdf/2502.15381
Copy Paste: [[2502.15381]] MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing(https://arxiv.org/abs/2502.15381)
Keywords: large language model
Abstract: Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

Title: Enhancing Vehicle Make and Model Recognition with 3D Attention Modules

Authors: Narges Semiromizadeh, Omid Nejati Manzari, Shahriar B. Shokouhi, Sattar Mirzakuchaki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15398
Pdf URL: https://arxiv.org/pdf/2502.15398
Copy Paste: [[2502.15398]] Enhancing Vehicle Make and Model Recognition with 3D Attention Modules(https://arxiv.org/abs/2502.15398)
Keywords: transformer
Abstract: Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System, garnering significant attention in recent years. VMMR has been widely utilized for detecting suspicious vehicles, monitoring urban traffic, and autonomous driving systems. The complexity of VMMR arises from the subtle visual distinctions among vehicle models and the wide variety of classes produced by manufacturers. Convolutional Neural Networks (CNNs), a prominent type of deep learning model, have been extensively employed in various computer vision tasks, including VMMR, yielding remarkable results. As VMMR is a fine-grained classification problem, it primarily faces inter-class similarity and intra-class variation challenges. In this study, we implement an attention module to address these challenges and enhance the model's focus on critical areas containing distinguishing features. This module, which does not increase the parameters of the original model, generates three-dimensional (3-D) attention weights to refine the feature map. Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model, where the feature maps from these sections offer sufficient information about the input frames without being overly detailed or overly coarse. The performance of our proposed model, along with state-of-the-art (SOTA) convolutional and transformer-based models, was evaluated using the Stanford Cars dataset. Our proposed model achieved the highest accuracy, 90.69\%, among the compared models.

Title: Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning

Authors: Xuetao Ma, Wenbin Jiang, Hua Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15401
Pdf URL: https://arxiv.org/pdf/2502.15401
Copy Paste: [[2502.15401]] Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning(https://arxiv.org/abs/2502.15401)
Keywords: large language model
Abstract: In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.

Title: Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution

Authors: Carlos Eiras-Franco, Anna Hedström, Marina M.-C. Höhne
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15403
Pdf URL: https://arxiv.org/pdf/2502.15403
Copy Paste: [[2502.15403]] Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution(https://arxiv.org/abs/2502.15403)
Keywords: robust
Abstract: Obtaining high-quality explanations of a model's output enables developers to identify and correct biases, align the system's behavior with human values, and ensure ethical compliance. Explainable Artificial Intelligence (XAI) practitioners rely on specific measures to gauge the quality of such explanations. These measures assess key attributes, such as how closely an explanation aligns with a model's decision process (faithfulness), how accurately it pinpoints the relevant input features (localization), and its consistency across different cases (robustness). Despite providing valuable information, these measures do not fully address a critical practitioner's concern: how does the quality of a given explanation compare to other potential explanations? Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE). The QGE method offers a direct comparison to what can be viewed as the `inverse' explanation, one that conceptually represents the antithesis of the original explanation. Our extensive testing across multiple model architectures, datasets, and established quality metrics demonstrates that the QGE method is superior to the traditional approach. Furthermore, we show that QGE enhances the statistical reliability of these quality assessments. This advance represents a significant step toward a more insightful evaluation of explanations that enables a more effective inspection of a model's behavior.

Title: HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Authors: Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15411
Pdf URL: https://arxiv.org/pdf/2502.15411
Copy Paste: [[2502.15411]] HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings(https://arxiv.org/abs/2502.15411)
Keywords: extraction, large language model
Abstract: The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts

Title: MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

Authors: Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, Nirmal Punjabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15418
Pdf URL: https://arxiv.org/pdf/2502.15418
Copy Paste: [[2502.15418]] MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models(https://arxiv.org/abs/2502.15418)
Keywords: large language model
Abstract: Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.

Title: Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

Authors: Yi-Ling Chung, Aurora Cobo, Pablo Serna
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.15419
Pdf URL: https://arxiv.org/pdf/2502.15419
Copy Paste: [[2502.15419]] Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking(https://arxiv.org/abs/2502.15419)
Keywords: robust, large language model
Abstract: Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.

Title: Evaluating Multimodal Generative AI with Korean Educational Standards

Authors: Sanghee Park, Geewook Kim
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.15422
Pdf URL: https://arxiv.org/pdf/2502.15422
Copy Paste: [[2502.15422]] Evaluating Multimodal Generative AI with Korean Educational Standards(https://arxiv.org/abs/2502.15422)
Keywords: generative
Abstract: This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at this https URL.

Title: Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Authors: Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush Varshney
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15427
Pdf URL: https://arxiv.org/pdf/2502.15427
Copy Paste: [[2502.15427]] Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs(https://arxiv.org/abs/2502.15427)
Keywords: security, attack, robust, large language model
Abstract: As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at this https URL.

Title: Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations

Authors: Lihu Chen, Shuojie Fu, Gabriel Freedman, Cemre Zor, Guy Martin, James Kinross, Uddhav Vaghela, Ovidiu Serban, Francesca Toni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15429
Pdf URL: https://arxiv.org/pdf/2502.15429
Copy Paste: [[2502.15429]] Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations(https://arxiv.org/abs/2502.15429)
Keywords: explainability, large language model
Abstract: A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: vanilla reasoning, retrieval-augmented generation, and multi-agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open-source benchmark, PubMed Retraction, comprising over 11K real-world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub-Guard-LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub-Guard-LLM contributes to safeguarding research integrity with a novel, effective, open-source tool.

Title: Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

Authors: Yue Zhou, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15434
Pdf URL: https://arxiv.org/pdf/2502.15434
Copy Paste: [[2502.15434]] Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation(https://arxiv.org/abs/2502.15434)
Keywords: robust, large language model
Abstract: Model merging integrates the parameters of multiple models into a unified model, combining their diverse capabilities. Existing model merging methods are often constrained by fixed parameter merging ratios. In this study, we propose Mixup Model Merge (M$^3$), an innovative approach inspired by the Mixup data augmentation technique. This method merges the parameters of two large language models (LLMs) by randomly generating linear interpolation ratios, allowing for a more flexible and comprehensive exploration of the parameter space. Extensive experiments demonstrate the superiority of our proposed M$^3$ method in merging fine-tuned LLMs: (1) it significantly improves performance across multiple tasks, (2) it enhances LLMs' out-of-distribution (OOD) robustness and adversarial robustness, (3) it achieves superior results when combined with sparsification techniques such as DARE, and (4) it offers a simple yet efficient solution that does not require additional computational resources. In conclusion, M$^3$ is a simple yet effective model merging method that significantly enhances the performance of the merged model by randomly generating contribution ratios for two fine-tuned LLMs. The code is available at this https URL.

Title: Single-pass Detection of Jailbreaking Input in Large Language Models

Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15435
Pdf URL: https://arxiv.org/pdf/2502.15435
Copy Paste: [[2502.15435]] Single-pass Detection of Jailbreaking Input in Large Language Models(https://arxiv.org/abs/2502.15435)
Keywords: attack, large language model
Abstract: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
Subjects: cs.LG, cs.AI, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2502.15436
Pdf URL: https://arxiv.org/pdf/2502.15436
Copy Paste: [[2502.15436]] Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning(https://arxiv.org/abs/2502.15436)
Keywords: privacy, federate
Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at this https URL.

Title: When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Authors: Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15443
Pdf URL: https://arxiv.org/pdf/2502.15443
Copy Paste: [[2502.15443]] When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models(https://arxiv.org/abs/2502.15443)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

Title: MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition

Authors: Paul Koch, Marian Schlüter, Jörg Krüger
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15448
Pdf URL: https://arxiv.org/pdf/2502.15448
Copy Paste: [[2502.15448]] MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition(https://arxiv.org/abs/2502.15448)
Keywords: robust
Abstract: We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling -- combined in a single application-oriented benchmark.

Title: A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs

Authors: Yuan Sun
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15451
Pdf URL: https://arxiv.org/pdf/2502.15451
Copy Paste: [[2502.15451]] A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs(https://arxiv.org/abs/2502.15451)
Keywords: large language model
Abstract: MoE (Mixture-of-Expert) architectures appear frequently in large language models, and the number of experts can be over one hundred recently. However, the expert load imbalance problem always happens in MoE model pre-training, which will cause routing collapse or increased computational overhead. In order to balance loads on experts, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q that can help change the top-K order of s by solving a binary integer programming with very small time costs. In simulation experiments, we observe that BIP-Based Balancing make imbalance disappoint very fast, while the final sum of routine scores decreases very little. Our algorithm achieves nearly perfect trade-off between expert load balance and pre-training efficiency under the simulation view.

Title: R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning

Authors: Jinda Liu, Yi Chang, Yuan Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15455
Pdf URL: https://arxiv.org/pdf/2502.15455
Copy Paste: [[2502.15455]] R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning(https://arxiv.org/abs/2502.15455)
Keywords: large language model
Abstract: Fine-tuning large language models (LLMs) is prohibitively expensive in terms of computational and memory costs. Low-rank Adaptation (LoRA), as one of the most popular parameter-efficient fine-tuning (PEFT) methods, offers a cost-effective alternative by approximating the model changes $\Delta W \in \mathbb{R}^{m \times n}$ through the product of down-projection matrix $A \in \mathbb{R}^{m \times r}$ and head matrix $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$. In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA's capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Random Initialization and Multi-Head Dropout, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Extensive experiments demonstrate that R-LoRA is better at capturing task-specific knowledge, thereby improving performance in multi-task scenarios. The code is available at this https URL.

Title: Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Authors: Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15457
Pdf URL: https://arxiv.org/pdf/2502.15457
Copy Paste: [[2502.15457]] Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs(https://arxiv.org/abs/2502.15457)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

Title: Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance

Authors: Yongli Yan, Linglong Dai
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2502.15475
Pdf URL: https://arxiv.org/pdf/2502.15475
Copy Paste: [[2502.15475]] Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance(https://arxiv.org/abs/2502.15475)
Keywords: robust
Abstract: Neural network-based decoding methods have shown promise in enhancing error correction performance, but traditional approaches struggle with the challenges posed by punctured codes. In particular, these methods fail to address the complexities of variable code rates and the need for protocol compatibility. This paper presents a unified Long Short-Term Memory (LSTM)-based decoding architecture specifically designed to overcome these challenges. The proposed method unifies punctured convolutional and Turbo codes. A puncture embedding mechanism integrates puncturing patterns directly into the network, enabling seamless adaptation to varying code rates, while balanced bit error rate training ensures robustness across different code lengths, rates, and channels, maintaining protocol flexibility. Extensive simulations in Additive White Gaussian Noise and Rayleigh fading channels demonstrate that the proposed approach outperforms conventional decoding techniques, providing significant improvements in decoding accuracy and robustness. These results underscore the potential of LSTM-based decoding as a promising solution for next-generation artificial intelligence powered communication systems.

Title: Confidence-Based Annotation Of Brain Tumours In Ultrasound

Authors: Alistair Weld, Luke Dixon, Alfie Roddan, Giulio Anichini, Sophie Camp, Stamatia Giannarou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15484
Pdf URL: https://arxiv.org/pdf/2502.15484
Copy Paste: [[2502.15484]] Confidence-Based Annotation Of Brain Tumours In Ultrasound(https://arxiv.org/abs/2502.15484)
Keywords: segmentation
Abstract: Purpose: An investigation of the challenge of annotating discrete segmentations of brain tumours in ultrasound, with a focus on the issue of aleatoric uncertainty along the tumour margin, particularly for diffuse tumours. A segmentation protocol and method is proposed that incorporates this margin-related uncertainty while minimising the interobserver variance through reduced subjectivity, thereby diminishing annotator epistemic uncertainty. Approach: A sparse confidence method for annotation is proposed, based on a protocol designed using computer vision and radiology theory. Results: Output annotations using the proposed method are compared with the corresponding professional discrete annotation variance between the observers. A linear relationship was measured within the tumour margin region, with a Pearson correlation of 0.8. The downstream application was explored, comparing training using confidence annotations as soft labels with using the best discrete annotations as hard labels. In all evaluation folds, the Brier score was superior for the soft-label trained network. Conclusion: A formal framework was constructed to demonstrate the infeasibility of discrete annotation of brain tumours in B-mode ultrasound. Subsequently, a method for sparse confidence-based annotation is proposed and evaluated. Keywords: Brain tumours, ultrasound, confidence, annotation.

Title: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Authors: Martina Miliani, Serenna Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15487
Pdf URL: https://arxiv.org/pdf/2502.15487
Copy Paste: [[2502.15487]] ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models(https://arxiv.org/abs/2502.15487)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

Title: Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis

Authors: Alexandre Gemayel, Dimitrios Michael Manias, Abdallah Shami
Subjects: cs.LG, cs.NI, eess.SP, eess.SY
Abstract URL: https://arxiv.org/abs/2502.15491
Pdf URL: https://arxiv.org/pdf/2502.15491
Copy Paste: [[2502.15491]] Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis(https://arxiv.org/abs/2502.15491)
Keywords: extraction
Abstract: As smart cities begin to materialize, the role of Unmanned Aerial Vehicles (UAVs) and their reliability becomes increasingly important. One aspect of reliability relates to Condition Monitoring (CM), where Machine Learning (ML) models are leveraged to identify abnormal and adverse conditions. Given the resource-constrained nature of next-generation edge networks, the utilization of precious network resources must be minimized. This work explores the optimization of network resources for ML-based UAV CM frameworks. The developed framework uses experimental data and varies the feature extraction aggregation interval to optimize ML model selection. Additionally, by leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.

Title: Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Authors: Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15499
Pdf URL: https://arxiv.org/pdf/2502.15499
Copy Paste: [[2502.15499]] Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models(https://arxiv.org/abs/2502.15499)
Keywords: transformer, large language model
Abstract: Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at this https URL.

Title: Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing

Authors: Masaya Kobayashi, Masane Fuchi, Amar Zanashir, Tomonori Yoneda, Tomohiro Takagi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15506
Pdf URL: https://arxiv.org/pdf/2502.15506
Copy Paste: [[2502.15506]] Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing(https://arxiv.org/abs/2502.15506)
Keywords: security, attack, large language model
Abstract: With the emergence of high-performance large language models (LLMs) such as GPT, Claude, and Gemini, the autonomous and semi-autonomous execution of tasks has significantly advanced across various domains. However, in highly specialized fields such as cybersecurity, full autonomy remains a challenge. This difficulty primarily stems from the limitations of LLMs in reasoning capabilities and domain-specific knowledge. We propose a system that semi-autonomously executes complex cybersecurity workflows by employing multiple LLMs modules to formulate attack strategies, generate commands, and analyze results, thereby addressing the aforementioned challenges. In our experiments using Hack The Box virtual machines, we confirmed that our system can autonomously construct attack strategies, issue appropriate commands, and automate certain processes, thereby reducing the need for manual intervention.

Title: Activation Steering in Neural Theorem Provers

Authors: Shashank Kirtania
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15507
Pdf URL: https://arxiv.org/pdf/2502.15507
Copy Paste: [[2502.15507]] Activation Steering in Neural Theorem Provers(https://arxiv.org/abs/2502.15507)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.

Title: SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning

Authors: Xuyang Li, Romit Maulik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15512
Pdf URL: https://arxiv.org/pdf/2502.15512
Copy Paste: [[2502.15512]] SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning(https://arxiv.org/abs/2502.15512)
Keywords: interpretability
Abstract: Modern deep reinforcement learning (DRL) methods have made significant advances in handling continuous action spaces. However, real-world control systems--especially those requiring precise and reliable performance--often demand formal stability, and existing DRL approaches typically lack explicit mechanisms to ensure or analyze stability. To address this limitation, we propose SALSA-RL (Stability Analysis in the Latent Space of Actions), a novel RL framework that models control actions as dynamic, time-dependent variables evolving within a latent space. By employing a pre-trained encoder-decoder and a state-dependent linear system, our approach enables both stability analysis and interpretability. We demonstrated that SALSA-RL can be deployed in a non-invasive manner for assessing the local stability of actions from pretrained RL agents without compromising on performance across diverse benchmark environments. By enabling a more interpretable analysis of action generation, SALSA-RL provides a powerful tool for advancing the design, analysis, and theoretical understanding of RL systems.

Title: Depth-aware Fusion Method based on Image and 4D Radar Spectrum for 3D Object Detection

Authors: Yue Sun, Yeqiang Qian, Chunxiang Wang, Ming Yang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2502.15516
Pdf URL: https://arxiv.org/pdf/2502.15516
Copy Paste: [[2502.15516]] Depth-aware Fusion Method based on Image and 4D Radar Spectrum for 3D Object Detection(https://arxiv.org/abs/2502.15516)
Keywords: robust
Abstract: Safety and reliability are crucial for the public acceptance of autonomous driving. To ensure accurate and reliable environmental perception, intelligent vehicles must exhibit accuracy and robustness in various environments. Millimeter-wave radar, known for its high penetration capability, can operate effectively in adverse weather conditions such as rain, snow, and fog. Traditional 3D millimeter-wave radars can only provide range, Doppler, and azimuth information for objects. Although the recent emergence of 4D millimeter-wave radars has added elevation resolution, the radar point clouds remain sparse due to Constant False Alarm Rate (CFAR) operations. In contrast, cameras offer rich semantic details but are sensitive to lighting and weather conditions. Hence, this paper leverages these two highly complementary and cost-effective sensors, 4D millimeter-wave radar and camera. By integrating 4D radar spectra with depth-aware camera images and employing attention mechanisms, we fuse texture-rich images with depth-rich radar data in the Bird's Eye View (BEV) perspective, enhancing 3D object detection. Additionally, we propose using GAN-based networks to generate depth images from radar spectra in the absence of depth sensors, further improving detection accuracy.

Title: PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning

Authors: Pengcheng Huang, Zhenghao Liu, Yukun Yan, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15543
Pdf URL: https://arxiv.org/pdf/2502.15543
Copy Paste: [[2502.15543]] PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning(https://arxiv.org/abs/2502.15543)
Keywords: large language model
Abstract: Knowledge-Augmented Generation (KAG) has shown great promise in updating the internal memory of Large Language Models (LLMs) by integrating external knowledge. However, KAG inevitably faces knowledge conflicts when the internal memory contradicts external information. Current approaches to mitigating these conflicts mainly focus on improving external knowledge utilization. However, these methods have shown only limited effectiveness in mitigating the knowledge conflict problem, as internal knowledge continues to influence the generation process of LLMs. In this paper, we propose a ParametrIc Pruning-based Knowledge-Augmented Generation (PIP-KAG) approach, which prunes internal knowledge of LLMs and incorporates a plug-and-play adaptation module to help LLMs better leverage external sources. Additionally, we construct the CoConflictQA benchmark based on the hallucination of LLMs to better evaluate contextual faithfulness during answering questions. Experimental results on CoConflictQA demonstrate that PIP-KAG significantly reduces knowledge conflicts and improves context fidelity. Notably, PIP-KAG reduces LLM's parameters by 13%, enhancing parameter efficiency in LLMs within the KAG framework. All codes are available at this https URL.

Title: Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach

Authors: Sai Krishna Reddy Mareddy, Dhanush Upplapati, Dhanush Kumar Antharam
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2502.15545
Pdf URL: https://arxiv.org/pdf/2502.15545
Copy Paste: [[2502.15545]] Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach(https://arxiv.org/abs/2502.15545)
Keywords: robust, transformer
Abstract: This project explores the application of advanced machine learning models, specifically Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Transformers, to the task of vehicle speed estimation using video data. Traditional methods of speed estimation, such as radar and manual systems, are often constrained by high costs, limited coverage, and potential disruptions. In contrast, leveraging existing surveillance infrastructure and cutting-edge neural network architectures presents a non-intrusive, scalable solution. Our approach utilizes LSTM and GRU to effectively manage long-term dependencies within the temporal sequence of video frames, while Transformers are employed to harness their self-attention mechanisms, enabling the processing of entire sequences in parallel and focusing on the most informative segments of the data. This study demonstrates that both LSTM and GRU outperform basic Recurrent Neural Networks (RNNs) due to their advanced gating mechanisms. Furthermore, increasing the sequence length of input data consistently improves model accuracy, highlighting the importance of contextual information in dynamic environments. Transformers, in particular, show exceptional adaptability and robustness across varied sequence lengths and complexities, making them highly suitable for real-time applications in diverse traffic conditions. The findings suggest that integrating these sophisticated neural network models can significantly enhance the accuracy and reliability of automated speed detection systems, thus promising to revolutionize traffic management and road safety.

Title: A Defensive Framework Against Adversarial Attacks on Machine Learning-Based Network Intrusion Detection Systems

Authors: Benyamin Tafreshian, Shengzhi Zhang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15561
Pdf URL: https://arxiv.org/pdf/2502.15561
Copy Paste: [[2502.15561]] A Defensive Framework Against Adversarial Attacks on Machine Learning-Based Network Intrusion Detection Systems(https://arxiv.org/abs/2502.15561)
Keywords: security, defense, attack, robust
Abstract: As cyberattacks become increasingly sophisticated, advanced Network Intrusion Detection Systems (NIDS) are critical for modern network security. Traditional signature-based NIDS are inadequate against zero-day and evolving attacks. In response, machine learning (ML)-based NIDS have emerged as promising solutions; however, they are vulnerable to adversarial evasion attacks that subtly manipulate network traffic to bypass detection. To address this vulnerability, we propose a novel defensive framework that enhances the robustness of ML-based NIDS by simultaneously integrating adversarial training, dataset balancing techniques, advanced feature engineering, ensemble learning, and extensive model fine-tuning. We validate our framework using the NSL-KDD and UNSW-NB15 datasets. Experimental results show, on average, a 35% increase in detection accuracy and a 12.5% reduction in false positives compared to baseline models, particularly under adversarial conditions. The proposed defense against adversarial attacks significantly advances the practical deployment of robust ML-based NIDS in real-world networks.

Title: Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses

Authors: Ganghua Wang, Yuhong Yang, Jie Ding
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.15567
Pdf URL: https://arxiv.org/pdf/2502.15567
Copy Paste: [[2502.15567]] Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses(https://arxiv.org/abs/2502.15567)
Keywords: security, privacy, defense, attack, steal
Abstract: The use of machine learning (ML) has become increasingly prevalent in various domains, highlighting the importance of understanding and ensuring its safety. One pressing concern is the vulnerability of ML applications to model stealing attacks. These attacks involve adversaries attempting to recover a learned model through limited query-response interactions, such as those found in cloud-based services or on-chip artificial intelligence interfaces. While existing literature proposes various attack and defense strategies, these often lack a theoretical foundation and standardized evaluation criteria. In response, this work presents a framework called ``Model Privacy'', providing a foundation for comprehensively analyzing model stealing attacks and defenses. We establish a rigorous formulation for the threat model and objectives, propose methods to quantify the goodness of attack and defense strategies, and analyze the fundamental tradeoffs between utility and privacy in ML models. Our developed theory offers valuable insights into enhancing the security of ML models, especially highlighting the importance of the attack-specific structure of perturbations for effective defenses. We demonstrate the application of model privacy from the defender's perspective through various learning scenarios. Extensive experiments corroborate the insights and the effectiveness of defense mechanisms developed under the proposed framework.

Title: A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany

Authors: Ina Dormuth, Sven Franke, Marlies Hafer, Tim Katzke, Alexander Marx, Emmanuel Müller, Daniel Neider, Markus Pauly, Jérôme Rutinowski
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15568
Pdf URL: https://arxiv.org/pdf/2502.15568
Copy Paste: [[2502.15568]] A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany(https://arxiv.org/abs/2502.15568)
Keywords: large language model
Abstract: In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.

Title: DReSD: Dense Retrieval for Speculative Decoding

Authors: Milan Gritta, Huiyin Xue, Gerasimos Lampouras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15572
Pdf URL: https://arxiv.org/pdf/2502.15572
Copy Paste: [[2502.15572]] DReSD: Dense Retrieval for Speculative Decoding(https://arxiv.org/abs/2502.15572)
Keywords: large language model
Abstract: Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).

Title: Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Authors: Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15576
Pdf URL: https://arxiv.org/pdf/2502.15576
Copy Paste: [[2502.15576]] Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders(https://arxiv.org/abs/2502.15576)
Keywords: attack, large language model
Abstract: Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

Title: FLARE: Fault Attack Leveraging Address Reconfiguration Exploits in Multi-Tenant FPGAs

Authors: Jayeeta Chaudhuri, Hassan Nassar, Dennis R.E. Gnad, Jorg Henkel, Mehdi B. Tahoori, Krishnendu Chakrabarty
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15578
Pdf URL: https://arxiv.org/pdf/2502.15578
Copy Paste: [[2502.15578]] FLARE: Fault Attack Leveraging Address Reconfiguration Exploits in Multi-Tenant FPGAs(https://arxiv.org/abs/2502.15578)
Keywords: security, attack, steal
Abstract: Modern FPGAs are increasingly supporting multi-tenancy to enable dynamic reconfiguration of user modules. While multi-tenant FPGAs improve utilization and flexibility, this paradigm introduces critical security threats. In this paper, we present FLARE, a fault attack that exploits vulnerabilities in the partial reconfiguration process, specifically while a user bitstream is being uploaded to the FPGA by a reconfiguration manager. Unlike traditional fault attacks that operate during module runtime, FLARE injects faults in the bitstream during its reconfiguration, altering the configuration address and redirecting it to unintended partial reconfigurable regions (PRRs). This enables the overwriting of pre-configured co-tenant modules, disrupting their functionality. FLARE leverages power-wasters that activate briefly during the reconfiguration process, making the attack stealthy and more challenging to detect with existing countermeasures. Experimental results on a Xilinx Pynq FPGA demonstrate the effectiveness of FLARE in compromising multiple user bitstreams during the reconfiguration process.

Title: Chats-Grid: An Iterative Retrieval Q&A Optimization Scheme Leveraging Large Model and Retrieval Enhancement Generation in smart grid

Authors: Yunfeng Li, Jiqun Zhang, Guofu Liao, Xue Shi, Junhong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15583
Pdf URL: https://arxiv.org/pdf/2502.15583
Copy Paste: [[2502.15583]] Chats-Grid: An Iterative Retrieval Q&A Optimization Scheme Leveraging Large Model and Retrieval Enhancement Generation in smart grid(https://arxiv.org/abs/2502.15583)
Keywords: large language model
Abstract: With rapid advancements in artificial intelligence, question-answering (Q&A) systems have become essential in intelligent search engines, virtual assistants, and customer service platforms. However, in dynamic domains like smart grids, conventional retrieval-augmented generation(RAG) Q&A systems face challenges such as inadequate retrieval quality, irrelevant responses, and inefficiencies in handling large-scale, real-time data streams. This paper proposes an optimized iterative retrieval-based Q&A framework called Chats-Grid tailored for smart grid environments. In the pre-retrieval phase, Chats-Grid advanced query expansion ensures comprehensive coverage of diverse data sources, including sensor readings, meter records, and control system parameters. During retrieval, Best Matching 25(BM25) sparse retrieval and BAAI General Embedding(BGE) dense retrieval in Chats-Grid are combined to process vast, heterogeneous datasets effectively. Post-retrieval, a fine-tuned large language model uses prompt engineering to assess relevance, filter irrelevant results, and reorder documents based on contextual accuracy. The model further generates precise, context-aware answers, adhering to quality criteria and employing a self-checking mechanism for enhanced reliability. Experimental results demonstrate Chats-Grid's superiority over state-of-the-art methods in fidelity, contextual recall, relevance, and accuracy by 2.37%, 2.19%, and 3.58% respectively. This framework advances smart grid management by improving decision-making and user interactions, fostering resilient and adaptive smart grid infrastructures.

Title: LightThinker: Thinking Step-by-Step Compression

Authors: Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2502.15589
Pdf URL: https://arxiv.org/pdf/2502.15589
Copy Paste: [[2502.15589]] LightThinker: Thinking Step-by-Step Compression(https://arxiv.org/abs/2502.15589)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at this https URL.

Title: Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Authors: Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15592
Pdf URL: https://arxiv.org/pdf/2502.15592
Copy Paste: [[2502.15592]] Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning(https://arxiv.org/abs/2502.15592)
Keywords: large language model
Abstract: Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: this https URL.

Title: SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

Authors: Jiaqi Wu, Chen Chen, Chunyan Hou, Xiaojie Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15594
Pdf URL: https://arxiv.org/pdf/2502.15594
Copy Paste: [[2502.15594]] SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention(https://arxiv.org/abs/2502.15594)
Keywords: defense, attack, large language model
Abstract: With the widespread real-world deployment of large language models (LLMs), ensuring their behavior complies with safety standards has become crucial. Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable behavior, posing a significant threat to LLM safety. Previous defenses often fail to achieve both effectiveness and efficiency simultaneously. Defenses from a representation perspective offer new insights, but existing interventions cannot dynamically adjust representations based on the harmfulness of the queries. To address this limitation while ensuring both effectiveness and efficiency, we propose SafeIntervention (SafeInt), a novel defense method that shields LLMs from jailbreak attacks through safety-aware representation intervention. SafeInt is built on our analysis of the representations of jailbreak samples. It adjusts representation distributions of jailbreak samples through intervention to align them with the representations of unsafe samples while minimizing unnecessary perturbations to jailbreak-irrelevant representations. We conduct comprehensive experiments covering six jailbreak attacks, two jailbreak datasets, and two utility benchmarks. Experimental results demonstrate that SafeInt outperforms all baselines in defending LLMs against jailbreak attacks while largely maintaining utility. Additionally, we evaluate SafeInt against adaptive attacks and verify its effectiveness in mitigating real-time attacks.

Title: Robust Bias Detection in MLMs and its Application to Human Trait Ratings

Authors: Ingroj Shrestha, Louis Tay, Padmini Srinivasan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15600
Pdf URL: https://arxiv.org/pdf/2502.15600
Copy Paste: [[2502.15600]] Robust Bias Detection in MLMs and its Application to Human Trait Ratings(https://arxiv.org/abs/2502.15600)
Keywords: robust
Abstract: There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of $\textit{personality}$ and $\textit{character}$ traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary $\textit{neo}$, while RoBERTa-large is the most biased for binary gender but shows small to no bias for $\textit{neo}$. There is some alignment of MLM bias and findings in psychology (human perspective) - in $\textit{agreeableness}$ with RoBERTa-large and $\textit{emotional stability}$ with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.

Title: WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

Authors: Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2502.15601
Pdf URL: https://arxiv.org/pdf/2502.15601
Copy Paste: [[2502.15601]] WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents(https://arxiv.org/abs/2502.15601)
Keywords: large language model
Abstract: Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.

Title: Do Multilingual LLMs Think In English?

Authors: Lisa Schut, Yarin Gal, Sebastian Farquhar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15603
Pdf URL: https://arxiv.org/pdf/2502.15603
Copy Paste: [[2502.15603]] Do Multilingual LLMs Think In English?(https://arxiv.org/abs/2502.15603)
Keywords: large language model
Abstract: Large language models (LLMs) have multilingual capabilities and can solve tasks across various languages. However, we show that current LLMs make key decisions in a representation space closest to English, regardless of their input and output languages. Exploring the internal representations with a logit lens for sentences in French, German, Dutch, and Mandarin, we show that the LLM first emits representations close to English for semantically-loaded words before translating them into the target language. We further show that activation steering in these LLMs is more effective when the steering vectors are computed in English rather than in the language of the inputs and outputs. This suggests that multilingual LLMs perform key reasoning steps in a representation that is heavily shaped by English in a way that is not transparent to system users.

Title: On the Robustness of Transformers against Context Hijacking for Linear Classification

Authors: Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.15609
Pdf URL: https://arxiv.org/pdf/2502.15609
Copy Paste: [[2502.15609]] On the Robustness of Transformers against Context Hijacking for Linear Classification(https://arxiv.org/abs/2502.15609)
Keywords: robust, transformer, large language model
Abstract: Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.

Title: PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification

Authors: Jixiu Zhai, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Xueying Wang, Dan Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15610
Pdf URL: https://arxiv.org/pdf/2502.15610
Copy Paste: [[2502.15610]] PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification(https://arxiv.org/abs/2502.15610)
Keywords: robust, extraction, transformer
Abstract: Protein post-translational modifications (PTMs) and bioactive peptides (BPs) play critical roles in various biological processes and have significant therapeutic potential. However, identifying PTM sites and bioactive peptides through experimental methods is often labor-intensive, costly, and time-consuming. As a result, computational tools, particularly those based on deep learning, have become effective solutions for predicting PTM sites and peptide bioactivity. Despite progress in this field, existing methods still struggle with the complexity of protein sequences and the challenge of requiring high-quality predictions across diverse datasets. To address these issues, we propose a deep learning framework that integrates pretrained protein language models with a neural network combining transformer and CNN for peptide classification. By leveraging the ability of pretrained models to capture complex relationships within protein sequences, combined with the predictive power of parallel networks, our approach improves feature extraction while enhancing prediction accuracy. This framework was applied to multiple tasks involving PTM site and bioactive peptide prediction, utilizing large-scale datasets to enhance the model's robustness. In the comparison across 33 tasks, the model achieved state-of-the-art (SOTA) performance in 25 of them, surpassing existing methods and demonstrating its versatility across different datasets. Our results suggest that this approach provides a scalable and effective solution for large-scale peptide discovery and PTM analysis, paving the way for more efficient peptide classification and functional annotation.

Title: LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models

Authors: Hugo Pitorro, Marcos Treviso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15612
Pdf URL: https://arxiv.org/pdf/2502.15612
Copy Paste: [[2502.15612]] LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models(https://arxiv.org/abs/2502.15612)
Keywords: interpretability, transformer
Abstract: State space models (SSMs), such as Mamba, have emerged as an efficient alternative to transformers for long-context sequence modeling. However, despite their growing adoption, SSMs lack the interpretability tools that have been crucial for understanding and improving attention-based architectures. While recent efforts provide insights into Mamba's internal mechanisms, they do not explicitly decompose token-wise contributions, leaving gaps in understanding how Mamba selectively processes sequences across layers. In this work, we introduce LaTIM, a novel token-level decomposition method for both Mamba-1 and Mamba-2 that enables fine-grained interpretability. We extensively evaluate our method across diverse tasks, including machine translation, copying, and retrieval-based generation, demonstrating its effectiveness in revealing Mamba's token-to-token interaction patterns.

Title: Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

Authors: Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15618
Pdf URL: https://arxiv.org/pdf/2502.15618
Copy Paste: [[2502.15618]] Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing(https://arxiv.org/abs/2502.15618)
Keywords: large language model
Abstract: We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at this https URL.

Title: Extraction multi-étiquettes de relations en utilisant des couches de Transformer

Authors: Ngoc Luyen Le, Gildas Tagny Ngompé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15619
Pdf URL: https://arxiv.org/pdf/2502.15619
Copy Paste: [[2502.15619]] Extraction multi-étiquettes de relations en utilisant des couches de Transformer(https://arxiv.org/abs/2502.15619)
Keywords: extraction, transformer
Abstract: In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family - such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT - with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine'25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.

Title: Mildly Accurate Computationally Differentially Private Inner Product Protocols Imply Oblivious Transfer

Authors: Iftach Haitner, Noam Mazor, Jad Silbak, Eliad Tsfadia, Chao Yan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.15629
Pdf URL: https://arxiv.org/pdf/2502.15629
Copy Paste: [[2502.15629]] Mildly Accurate Computationally Differentially Private Inner Product Protocols Imply Oblivious Transfer(https://arxiv.org/abs/2502.15629)
Keywords: secure, privacy, protect
Abstract: In distributed differential privacy, multiple parties collaboratively analyze their combined data while protecting the privacy of each party's data from the eyes of the others. Interestingly, for certain fundamental two-party functions like inner product and Hamming distance, the accuracy of distributed solutions significantly lags behind what can be achieved in the centralized model. However, under computational differential privacy, these limitations can be circumvented using oblivious transfer via secure multi-party computation. Yet, no results show that oblivious transfer is indeed necessary for accurately estimating a non-Boolean functionality. In particular, for the inner-product functionality, it was previously unknown whether oblivious transfer is necessary even for the best possible constant additive error. In this work, we prove that any computationally differentially private protocol that estimates the inner product over $\{-1,1\}^n \times \{-1,1\}^n$ up to an additive error of $O(n^{1/6})$, can be used to construct oblivious transfer. In particular, our result implies that protocols with sub-polynomial accuracy are equivalent to oblivious transfer. In this accuracy regime, our result improves upon Haitner, Mazor, Silbak, and Tsfadia [STOC '22] who showed that a key-agreement protocol is necessary.

Title: The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

Authors: Marthe Ballon, Andres Algaba, Vincent Ginis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15631
Pdf URL: https://arxiv.org/pdf/2502.15631
Copy Paste: [[2502.15631]] The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer(https://arxiv.org/abs/2502.15631)
Keywords: large language model
Abstract: Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

Title: Continual Person Identification using Footstep-Induced Floor Vibrations on Heterogeneous Floor Structures

Authors: Yiwen Dong, Hae Young Noh
Subjects: cs.CV, eess.SP, physics.app-ph
Abstract URL: https://arxiv.org/abs/2502.15632
Pdf URL: https://arxiv.org/pdf/2502.15632
Copy Paste: [[2502.15632]] Continual Person Identification using Footstep-Induced Floor Vibrations on Heterogeneous Floor Structures(https://arxiv.org/abs/2502.15632)
Keywords: privacy
Abstract: Person identification is important for smart buildings to provide personalized services such as health monitoring, activity tracking, and personnel management. However, previous person identification relies on pre-collected data from everyone, which is impractical in many buildings and public facilities in which visitors are typically expected. This calls for a continual person identification system that gradually learns people's identities on the fly. Existing studies use cameras to achieve this goal, but they require direct line-of-sight and also have raised privacy concerns in public. Other modalities such as wearables and pressure mats are limited by the requirement of device-carrying or dense deployment. Thus, prior studies introduced footstep-induced structural vibration sensing, which is non-intrusive and perceived as more privacy-friendly. However, this approach has a significant challenge: the high variability of vibration data due to structural heterogeneity and human gait variations, which makes online person identification algorithms perform poorly. In this paper, we characterize the variability in footstep-induced structural vibration data for accurate online person identification. To achieve this, we quantify and decompose different sources of variability and then design a feature transformation function to reduce the variability within each person's data to make different people's data more separable. We evaluate our approach through field experiments with 20 people. The results show a 70% variability reduction and a 90% accuracy for online person identification.

Title: RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes

Authors: Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15633
Pdf URL: https://arxiv.org/pdf/2502.15633
Copy Paste: [[2502.15633]] RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes(https://arxiv.org/abs/2502.15633)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM, as it can produce high-fidelity novel views. However, previous GS-based methods primarily target indoor scenes and rely on RGB-D sensors or pre-trained depth estimation models, hence underperforming in outdoor scenarios. To address this issue, we propose a RGB-only gaussian splatting SLAM method for unbounded outdoor scenes--OpenGS-SLAM. Technically, we first employ a pointmap regression network to generate consistent pointmaps between frames for pose estimation. Compared to commonly used depth maps, pointmaps include spatial relationships and scene geometry across multiple views, enabling robust camera pose estimation. Then, we propose integrating the estimated camera poses with 3DGS rendering as an end-to-end differentiable pipeline. Our method achieves simultaneous optimization of camera poses and 3DGS scene parameters, significantly enhancing system tracking accuracy. Specifically, we also design an adaptive scale mapper for the pointmap regression network, which provides more accurate pointmap mapping to the 3DGS map representation. Our experiments on the Waymo dataset demonstrate that OpenGS-SLAM reduces tracking error to 9.8\% of previous 3DGS methods, and achieves state-of-the-art results in novel view synthesis. Project Page: this https URL

Title: Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

Authors: Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, Ievgen Redko
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.15637
Pdf URL: https://arxiv.org/pdf/2502.15637
Copy Paste: [[2502.15637]] Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification(https://arxiv.org/abs/2502.15637)
Keywords: transformer
Abstract: In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

Title: Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

Authors: Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15639
Pdf URL: https://arxiv.org/pdf/2502.15639
Copy Paste: [[2502.15639]] Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models(https://arxiv.org/abs/2502.15639)
Keywords: large language model
Abstract: Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions -- a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM's activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

Title: AutoTandemML: Active Learning Enhanced Tandem Neural Networks for Inverse Design Problems

Authors: Luka Grbcic, Juliane Müller, Wibe Albert de Jong
Subjects: cs.LG, cs.AI, cs.CE, cs.NE
Abstract URL: https://arxiv.org/abs/2502.15643
Pdf URL: https://arxiv.org/pdf/2502.15643
Copy Paste: [[2502.15643]] AutoTandemML: Active Learning Enhanced Tandem Neural Networks for Inverse Design Problems(https://arxiv.org/abs/2502.15643)
Keywords: diffusion
Abstract: Inverse design in science and engineering involves determining optimal design parameters that achieve desired performance outcomes, a process often hindered by the complexity and high dimensionality of design spaces, leading to significant computational costs. To tackle this challenge, we propose a novel hybrid approach that combines active learning with Tandem Neural Networks to enhance the efficiency and effectiveness of solving inverse design problems. Active learning allows to selectively sample the most informative data points, reducing the required dataset size without compromising accuracy. We investigate this approach using three benchmark problems: airfoil inverse design, photonic surface inverse design, and scalar boundary condition reconstruction in diffusion partial differential equations. We demonstrate that integrating active learning with Tandem Neural Networks outperforms standard approaches across the benchmark suite, achieving better accuracy with fewer training samples.

Title: Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors

Authors: Barbara Bodinier, Gaetan Dissez, Linus Bleistein, Antonin Dauvin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15646
Pdf URL: https://arxiv.org/pdf/2502.15646
Copy Paste: [[2502.15646]] Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors(https://arxiv.org/abs/2502.15646)
Keywords: robust
Abstract: Preclinical perturbation screens, where the effects of genetic, chemical, or environmental perturbations are systematically tested on disease models, hold significant promise for machine learning-enhanced drug discovery due to their scale and causal nature. Predictive models can infer perturbation responses for previously untested disease models based on molecular profiles. These in silico labels can expand databases and guide experimental prioritization. However, modelling perturbation-specific effects and generating robust prediction performances across diverse biological contexts remain elusive. We introduce LEAP (Layered Ensemble of Autoencoders and Predictors), a novel ensemble framework to improve robustness and generalization. LEAP leverages multiple DAMAE (Data Augmented Masked Autoencoder) representations and LASSO regressors. By combining diverse gene expression representation models learned from different random initializations, LEAP consistently outperforms state-of-the-art approaches in predicting gene essentiality or drug responses in unseen cell lines, tissues and disease models. Notably, our results show that ensembling representation models, rather than prediction models alone, yields superior predictive performance. Beyond its performance gains, LEAP is computationally efficient, requires minimal hyperparameter tuning and can therefore be readily incorporated into drug discovery pipelines to prioritize promising targets and support biomarker-driven stratification. The code and datasets used in this work are made publicly available.

Title: Blockchain-based Trust Management in Security Credential Management System for Vehicular Network

Authors: SangHyun Byun, Arijet Sarker, Sang-Yoon Chang, Jugal Kalita
Subjects: cs.CR, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2502.15653
Pdf URL: https://arxiv.org/pdf/2502.15653
Copy Paste: [[2502.15653]] Blockchain-based Trust Management in Security Credential Management System for Vehicular Network(https://arxiv.org/abs/2502.15653)
Keywords: security, privacy, protect
Abstract: Cellular networking is advancing as a wireless technology to support diverse applications in vehicular communication, enabling vehicles to interact with various applications to enhance the driving experience, even when managed by different authorities. Security Credential Management System (SCMS) is the Public Key Infrastructure (PKI) for vehicular networking and the state-of-the-art distributed PKI to protect the privacy-preserving vehicular networking against an honest-but-curious authority using multiple authorities and to decentralize the trust management. We build a Blockchain-Based Trust Management (BBTM) to provide even greater decentralization and security. Specifically, BBTM uses the blockchain to 1) replace the existing Policy Generator (PG), 2) manage the policy of each authority in SCMS, 3) aggregate the Global Certificate Chain File (GCCF), and 4) provide greater accountability and transparency on the aforementioned functionalities. We implement BBTM on Hyperledger Fabric using a smart contract for experimentation and analyses. Our experiments show that BBTM is lightweight in processing, efficient management in the certificate chain and ledger size, supports a bandwidth of multiple transactions per second, and provides validated end-entities.

Title: Machine-generated text detection prevents language model collapse

Authors: George Drayson, Vasileios Lampos
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15654
Pdf URL: https://arxiv.org/pdf/2502.15654
Copy Paste: [[2502.15654]] Machine-generated text detection prevents language model collapse(https://arxiv.org/abs/2502.15654)
Keywords: generative, large language model
Abstract: As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since web data is the primary resource for LLM pretraining, future models will be trained on an unknown portion of synthetic data. This will lead to model collapse, a degenerative process which causes models to reinforce their own errors and experience a drop in model performance. In this study, we investigate the impact of decoding strategy on model collapse, where we analyse the characteristics of the generated data during recursive training, its similarity to human references and the resulting model performance. Using the decoding strategies that lead to the most significant model degradation, we tackle the question: how to avoid model collapse when the origin (human or synthetic) of the training data is unknown. We design a novel methodology based on resampling the data distribution using importance weights from our machine-generated text detector. Our method is validated on two LLM variants (GPT-2 and SmolLM2) on the open-ended text generation task, demonstrating that we can successfully prevent model collapse and when there is enough human-authored data in the training dataset, our method improves model performance.

Title: Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

Authors: Shoumik Saha, Soheil Feizi
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15666
Pdf URL: https://arxiv.org/pdf/2502.15666
Copy Paste: [[2502.15666]] Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing(https://arxiv.org/abs/2502.15666)
Keywords: large language model
Abstract: The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains $11.7K$ samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

Title: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

Authors: Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2502.15672
Pdf URL: https://arxiv.org/pdf/2502.15672
Copy Paste: [[2502.15672]] VaViM and VaVAM: Autonomous Driving through Video Generative Modeling(https://arxiv.org/abs/2502.15672)
Keywords: generative
Abstract: We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at this https URL

Title: FLEKE: Federated Locate-then-Edit Knowledge Editing

Authors: Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15677
Pdf URL: https://arxiv.org/pdf/2502.15677
Copy Paste: [[2502.15677]] FLEKE: Federated Locate-then-Edit Knowledge Editing(https://arxiv.org/abs/2502.15677)
Keywords: privacy, federate, large language model
Abstract: Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at this https URL.

Title: Testing the limits of fine-tuning to improve reasoning in vision language models

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15678
Pdf URL: https://arxiv.org/pdf/2502.15678
Copy Paste: [[2502.15678]] Testing the limits of fine-tuning to improve reasoning in vision language models(https://arxiv.org/abs/2502.15678)
Keywords: robust
Abstract: Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Title: Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

Authors: Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2502.15680
Pdf URL: https://arxiv.org/pdf/2502.15680
Copy Paste: [[2502.15680]] Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training(https://arxiv.org/abs/2502.15680)
Keywords: privacy
Abstract: Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as $\approx\!7.5\times$); and (3) removing PII can lead to other PII being memorized. Model creators should consider these first- and second-order privacy risks when training models to avoid the risk of new PII regurgitation.

Title: One-step Diffusion Models with $f$-Divergence Distribution Matching

Authors: Yilun Xu, Weili Nie, Arash Vahdat
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.15681
Pdf URL: https://arxiv.org/pdf/2502.15681
Copy Paste: [[2502.15681]] One-step Diffusion Models with $f$-Divergence Distribution Matching(https://arxiv.org/abs/2502.15681)
Keywords: diffusion
Abstract: Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: this https URL