2025-03-11

Title: What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Authors: Jordi Armengol-Estapé, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F.P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather
Subjects: cs.LG, cs.AI, cs.PL
Abstract URL: https://arxiv.org/abs/2503.05703
Pdf URL: https://arxiv.org/pdf/2503.05703
Copy Paste: [[2503.05703]] What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces(https://arxiv.org/abs/2503.05703)
Keywords: large language model
Abstract: Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.

Title: What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets

Authors: Marco Antonio Stranisci, Christian Hardmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05721
Pdf URL: https://arxiv.org/pdf/2503.05721
Copy Paste: [[2503.05721]] What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets(https://arxiv.org/abs/2503.05721)
Keywords: large language model
Abstract: Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic overview on these approaches. We survey 55 technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets.

Title: CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization

Authors: Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mostafa Rifat Tazwar, Md Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05750
Pdf URL: https://arxiv.org/pdf/2503.05750
Copy Paste: [[2503.05750]] CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization(https://arxiv.org/abs/2503.05750)
Keywords: extraction
Abstract: A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL-Context-driven Sequential TRansfer Learning-achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at TBA.

Title: Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models

Authors: Prasenjit Dey, Srujana Merugu, Sivaramakrishnan Kaveri
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.05757
Pdf URL: https://arxiv.org/pdf/2503.05757
Copy Paste: [[2503.05757]] Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models(https://arxiv.org/abs/2503.05757)
Keywords: large language model
Abstract: Large Language Models (LLMs) are known to hallucinate and generate non-factual outputs which can undermine user trust. Traditional methods to directly mitigate hallucinations, such as representation editing and contrastive decoding, often require additional training data and involve high implementation complexity. While ensemble-based approaches harness multiple LLMs to tap into the "wisdom of crowds", these methods overlook uncertainties in individual model responses. Recent studies reveal that uncertainty estimation can enable LLMs to self-assess the likelihood of generating hallucinations. In this work, we focus on factoid question answering (QA) and observe that LLMs accuracy and self-assessment capabilities vary widely with different models excelling in different scenarios. Leveraging this insight, we propose Uncertainty-Aware Fusion (UAF), an ensemble framework to reduces hallucinations by strategically combining multiple LLM based on their accuracy and self-assessment abilities. Empirical results on several public benchmark datasets show that UAF outperforms state-of-the-art hallucination mitigation methods by $8\%$ in factual accuracy, while either narrowing or surpassing the performance gap with GPT-4.

Title: Geometric Properties and Graph-Based Optimization of Neural Networks: Addressing Non-Linearity, Dimensionality, and Scalability

Authors: Michael Wienczkowski, Addisu Desta, Paschal Ugochukwu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05761
Pdf URL: https://arxiv.org/pdf/2503.05761
Copy Paste: [[2503.05761]] Geometric Properties and Graph-Based Optimization of Neural Networks: Addressing Non-Linearity, Dimensionality, and Scalability(https://arxiv.org/abs/2503.05761)
Keywords: robust
Abstract: Deep learning models are often considered black boxes due to their complex hierarchical transformations. Identifying suitable architectures is crucial for maximizing predictive performance with limited data. Understanding the geometric properties of neural networks involves analyzing their structure, activation functions, and the transformations they perform in high-dimensional space. These properties influence learning, representation, and decision-making. This research explores neural networks through geometric metrics and graph structures, building upon foundational work in arXiv:2007.06559. It addresses the limited understanding of geometric structures governing neural networks, particularly the data manifolds they operate on, which impact classification, optimization, and representation. We identify three key challenges: (1) overcoming linear separability limitations, (2) managing the dimensionality-complexity trade-off, and (3) improving scalability through graph representations. To address these, we propose leveraging non-linear activation functions, optimizing network complexity via pruning and transfer learning, and developing efficient graph-based models. Our findings contribute to a deeper understanding of neural network geometry, supporting the development of more robust, scalable, and interpretable models.

Title: Graph Masked Language Models

Authors: Aarush Sinha, OM Kumar CU
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05763
Pdf URL: https://arxiv.org/pdf/2503.05763
Copy Paste: [[2503.05763]] Graph Masked Language Models(https://arxiv.org/abs/2503.05763)
Keywords: robust
Abstract: Language Models (LMs) are integral to Natural Language Processing (NLP), yet their interaction with structured knowledge graphs (KGs) remains an open research challenge. While Graph Neural Networks (GNNs) excel at capturing graph structures, they struggle with textual feature representation compared to pretrained LMs. To bridge this gap, we propose \textbf{Graph Masked Language Models (GMLM)} for node classification tasks. Our approach introduces two key innovations: a \textit{semantic masking strategy} that selectively masks nodes based on their structural importance, ensuring critical graph components contribute effectively to learning, and a \textit{soft masking mechanism} that generates interpolated node representations, enabling smoother information retention and improved gradient flow. Our dual-branch model architecture fuses structural graph information with contextual embeddings via a multi-layer fusion network. Extensive experiments on six node classification benchmarks demonstrate that GMLM not only achieves state-of-the-art (SOTA) performance but also enhances robustness and stability across datasets.

Title: Evaluation of Missing Data Imputation for Time Series Without Ground Truth

Authors: Rania Farjallah, Bassant Selim, Brigitte Jaumard, Samr Ali, Georges Kaddoum
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.05775
Pdf URL: https://arxiv.org/pdf/2503.05775
Copy Paste: [[2503.05775]] Evaluation of Missing Data Imputation for Time Series Without Ground Truth(https://arxiv.org/abs/2503.05775)
Keywords: robust
Abstract: The challenge of handling missing data in time series is critical for maintaining the accuracy and reliability of machine learning (ML) models in applications like fifth generation mobile communication (5G) network management. Traditional methods for validating imputation rely on ground truth data, which is inherently unavailable. This paper addresses this limitation by introducing two statistical metrics, the wasserstein distance (WD) and jensen-shannon divergence (JSD), to evaluate imputation quality without requiring ground truth. These metrics assess the alignment between the distributions of imputed and original data, providing a robust method for evaluating imputation performance based on internal structure and data consistency. We apply and test these metrics across several imputation techniques. Results demonstrate that WD and JSD are effective metrics for assessing the quality of missing data imputation, particularly in scenarios where ground truth data is unavailable.

Title: FAA-CLIP: Federated Adversarial Adaptation of CLIP

Authors: Yihang Wu, Ahmad Chaddad, Christian Desrosiers, Tareef Daqqaq, Reem Kateb
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05776
Pdf URL: https://arxiv.org/pdf/2503.05776
Copy Paste: [[2503.05776]] FAA-CLIP: Federated Adversarial Adaptation of CLIP(https://arxiv.org/abs/2503.05776)
Keywords: federate
Abstract: Despite the remarkable performance of vision language models (VLMs) such as Contrastive Language Image Pre-training (CLIP), the large size of these models is a considerable obstacle to their use in federated learning (FL) systems where the parameters of local client models need to be transferred to a global server for aggregation. Another challenge in FL is the heterogeneity of data from different clients, which affects the generalization performance of the solution. In addition, natural pre-trained VLMs exhibit poor generalization ability in the medical datasets, suggests there exists a domain gap. To solve these issues, we introduce a novel method for the Federated Adversarial Adaptation (FAA) of CLIP. Our method, named FAA-CLIP, handles the large communication costs of CLIP using a light-weight feature adaptation module (FAM) for aggregation, effectively adapting this VLM to each client's data while greatly reducing the number of parameters to transfer. By keeping CLIP frozen and only updating the FAM parameters, our method is also computationally efficient. Unlike existing approaches, our FAA-CLIP method directly addresses the problem of domain shifts across clients via a domain adaptation (DA) module. This module employs a domain classifier to predict if a given sample is from the local client or the global server, allowing the model to learn domain-invariant representations. Extensive experiments on six different datasets containing both natural and medical images demonstrate that FAA-CLIP can generalize well on both natural and medical datasets compared to recent FL approaches. Our codes are available at this https URL.

Title: Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Authors: Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Hae Won Park, Samir Tulebaev, Cynthia Breazeal
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.05777
Pdf URL: https://arxiv.org/pdf/2503.05777
Copy Paste: [[2503.05777]] Medical Hallucinations in Foundation Models and Their Impact on Healthcare(https://arxiv.org/abs/2503.05777)
Keywords: robust
Abstract: Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at this https URL hallucination.

Title: DreamNet: A Multimodal Framework for Semantic and Emotional Analysis of Sleep Narratives

Authors: Tapasvi Panchagnula
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.05778
Pdf URL: https://arxiv.org/pdf/2503.05778
Copy Paste: [[2503.05778]] DreamNet: A Multimodal Framework for Semantic and Emotional Analysis of Sleep Narratives(https://arxiv.org/abs/2503.05778)
Keywords: transformer
Abstract: Dream narratives provide a unique window into human cognition and emotion, yet their systematic analysis using artificial intelligence has been underexplored. We introduce DreamNet, a novel deep learning framework that decodes semantic themes and emotional states from textual dream reports, optionally enhanced with REM-stage EEG data. Leveraging a transformer-based architecture with multimodal attention, DreamNet achieves 92.1% accuracy and 88.4% F1-score in text-only mode (DNet-T) on a curated dataset of 1,500 anonymized dream narratives, improving to 99.0% accuracy and 95.2% F1-score with EEG integration (DNet-M). Strong dream-emotion correlations (e.g., falling-anxiety, r = 0.91, p < 0.01) highlight its potential for mental health diagnostics, cognitive science, and personalized therapy. This work provides a scalable tool, a publicly available enriched dataset, and a rigorous methodology, bridging AI and psychological research.

Title: FedMentalCare: Towards Privacy-Preserving Fine-Tuned LLMs to Analyze Mental Health Status Using Federated Learning Framework

Authors: S M Sarwar
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05786
Pdf URL: https://arxiv.org/pdf/2503.05786
Copy Paste: [[2503.05786]] FedMentalCare: Towards Privacy-Preserving Fine-Tuned LLMs to Analyze Mental Health Status Using Federated Learning Framework(https://arxiv.org/abs/2503.05786)
Keywords: security, privacy, federate, large language model
Abstract: With the increasing prevalence of mental health conditions worldwide, AI-powered chatbots and conversational agents have emerged as accessible tools to support mental health. However, deploying Large Language Models (LLMs) in mental healthcare applications raises significant privacy concerns, especially regarding regulations like HIPAA and GDPR. In this work, we propose FedMentalCare, a privacy-preserving framework that leverages Federated Learning (FL) combined with Low-Rank Adaptation (LoRA) to fine-tune LLMs for mental health analysis. We investigate the performance impact of varying client data volumes and model architectures (e.g., MobileBERT and MiniLM) in FL environments. Our framework demonstrates a scalable, privacy-aware approach for deploying LLMs in real-world mental healthcare scenarios, addressing data security and computational efficiency challenges.

Title: Emergent Abilities in Large Language Models: A Survey

Authors: Leonardo Berti, Flavio Giorgi, Gjergji Kasneci
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.05788
Pdf URL: https://arxiv.org/pdf/2503.05788
Copy Paste: [[2503.05788]] Emergent Abilities in Large Language Models: A Survey(https://arxiv.org/abs/2503.05788)
Keywords: large language model
Abstract: Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving, have sparked an intense scientific debate: Are they truly emergent, or do they simply depend on external factors, such as training dynamics, the type of problems, or the chosen metric? What underlying mechanism causes them? Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications. In this work, we shed light on emergent abilities by conducting a comprehensive review of the phenomenon, addressing both its scientific underpinnings and real-world consequences. We first critically analyze existing definitions, exposing inconsistencies in conceptualizing emergent abilities. We then explore the conditions under which these abilities appear, evaluating the role of scaling laws, task complexity, pre-training loss, quantization, and prompting strategies. Our review extends beyond traditional LLMs and includes Large Reasoning Models (LRMs), which leverage reinforcement learning and inference-time search to amplify reasoning and self-reflection. However, emergence is not inherently positive. As AI systems gain autonomous reasoning capabilities, they also develop harmful behaviors, including deception, manipulation, and reward hacking. We highlight growing concerns about safety and governance, emphasizing the need for better evaluation frameworks and regulatory oversight.

Title: EXALT: EXplainable ALgorithmic Tools for Optimization Problems

Authors: Zuzanna Bączek, Michał Bizoń, Aneta Pawelec, Piotr Sankowski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05789
Pdf URL: https://arxiv.org/pdf/2503.05789
Copy Paste: [[2503.05789]] EXALT: EXplainable ALgorithmic Tools for Optimization Problems(https://arxiv.org/abs/2503.05789)
Keywords: robust
Abstract: Algorithmic solutions have significant potential to improve decision-making across various domains, from healthcare to e-commerce. However, the widespread adoption of these solutions is hindered by a critical challenge: the lack of human-interpretable explanations. Current approaches to Explainable AI (XAI) predominantly focus on complex machine learning models, often producing brittle and non-intuitive explanations. This project proposes a novel approach to developing explainable algorithms by starting with optimization problems, specifically the assignment problem. The developed software library enriches basic algorithms with human-understandable explanations through four key methodologies: generating meaningful alternative solutions, creating robust solutions through input perturbation, generating concise decision trees and providing reports with comprehensive explanation of the results. Currently developed tools are often designed with specific clustering algorithms in mind, which limits their adaptability and flexibility to incorporate alternative techniques. Additionally, many of these tools fail to integrate expert knowledge, which could enhance the clustering process by providing valuable insights and context. This lack of adaptability and integration can hinder the effectiveness and robustness of the clustering outcomes in various applications. The represents a step towards making algorithmic solutions more transparent, trustworthy, and accessible. By collaborating with industry partners in sectors such as sales, we demonstrate the practical relevance and transformative potential of our approach.

Title: CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking

Authors: Yiming Li, Kaiying Yan, Shuo Shao, Tongqing Zhai, Shu-Tao Xia, Zhan Qin, Dacheng Tao
Subjects: cs.CR, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.05794
Pdf URL: https://arxiv.org/pdf/2503.05794
Copy Paste: [[2503.05794]] CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking(https://arxiv.org/abs/2503.05794)
Keywords: protect, attack, robust, watermark
Abstract: With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW), enabling dataset owners to determine whether a suspicious third-party model has been trained on a protected dataset under a black-box setting. The CBW method consists of two key stages: dataset watermarking and ownership verification. During watermarking, we implant multiple trigger patterns in the dataset to make similar samples (measured by their feature similarities) close to the same trigger while dissimilar samples are near different triggers. This ensures that any model trained on the watermarked dataset exhibits specific misclassification behaviors when exposed to trigger-embedded inputs. To verify dataset ownership, we design a hypothesis-test-based framework that statistically evaluates whether a suspicious model exhibits the expected backdoor behavior. We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks. The code for reproducing main experiments is available at this https URL

Title: How Do Consumers Really Choose: Exposing Hidden Preferences with the Mixture of Experts Model

Authors: Diego Vallarino
Subjects: cs.LG, econ.EM
Abstract URL: https://arxiv.org/abs/2503.05800
Pdf URL: https://arxiv.org/pdf/2503.05800
Copy Paste: [[2503.05800]] How Do Consumers Really Choose: Exposing Hidden Preferences with the Mixture of Experts Model(https://arxiv.org/abs/2503.05800)
Keywords: segmentation
Abstract: Understanding consumer choice is fundamental to marketing and management research, as firms increasingly seek to personalize offerings and optimize customer engagement. Traditional choice modeling frameworks, such as multinomial logit (MNL) and mixed logit models, impose rigid parametric assumptions that limit their ability to capture the complexity of consumer decision-making. This study introduces the Mixture of Experts (MoE) framework as a machine learning-driven alternative that dynamically segments consumers based on latent behavioral patterns. By leveraging probabilistic gating functions and specialized expert networks, MoE provides a flexible, nonparametric approach to modeling heterogeneous preferences. Empirical validation using large-scale retail data demonstrates that MoE significantly enhances predictive accuracy over traditional econometric models, capturing nonlinear consumer responses to price variations, brand preferences, and product attributes. The findings underscore MoEs potential to improve demand forecasting, optimize targeted marketing strategies, and refine segmentation practices. By offering a more granular and adaptive framework, this study bridges the gap between data-driven machine learning approaches and marketing theory, advocating for the integration of AI techniques in managerial decision-making and strategic consumer insights.

Title: Federated Learning Framework via Distributed Mutual Learning

Authors: Yash Gupta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05803
Pdf URL: https://arxiv.org/pdf/2503.05803
Copy Paste: [[2503.05803]] Federated Learning Framework via Distributed Mutual Learning(https://arxiv.org/abs/2503.05803)
Keywords: privacy, federate
Abstract: Federated Learning often relies on sharing full or partial model weights, which can burden network bandwidth and raise privacy risks. We present a loss-based alternative using distributed mutual learning. Instead of transmitting weights, clients periodically share their loss predictions on a public test set. Each client then refines its model by combining its local loss with the average Kullback-Leibler divergence over losses from other clients. This collaborative approach both reduces transmission overhead and preserves data privacy. Experiments on a face mask detection task demonstrate that our method outperforms weight-sharing baselines, achieving higher accuracy on unseen data while providing stronger generalization and privacy benefits.

Title: Multi-agent Auto-Bidding with Latent Graph Diffusion Models

Authors: Dom Huh, Prasant Mohapatra
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2503.05805
Pdf URL: https://arxiv.org/pdf/2503.05805
Copy Paste: [[2503.05805]] Multi-agent Auto-Bidding with Latent Graph Diffusion Models(https://arxiv.org/abs/2503.05805)
Keywords: diffusion
Abstract: This paper proposes a diffusion-based auto-bidding framework that leverages graph representations to model large-scale auction environments. In such settings, agents must dynamically optimize bidding strategies under constraints defined by key performance indicator (KPI) metrics, all while operating in competitive environments characterized by uncertain, sparse, and stochastic variables. To address these challenges, we introduce a novel approach combining learnable graph-based embeddings with a planning-based latent diffusion model (LDM). By capturing patterns and nuances underlying the interdependence of impression opportunities and the multi-agent dynamics of the auction environment, the graph representation enable expressive computations regarding auto-bidding outcomes. With reward alignment techniques, the LDM's posterior is fine-tuned to generate auto-bidding trajectories that maximize KPI metrics while satisfying constraint thresholds. Empirical evaluations on both real-world and synthetic auction environments demonstrate significant improvements in auto-bidding performance across multiple common KPI metrics, as well as accuracy in forecasting auction outcomes.

Title: A Transformer Model for Predicting Chemical Reaction Products from Generic Templates

Authors: Derin Ozer, Sylvain Lamprier, Thomas Cauchy, Nicolas Gutowski, Benoit Da Mota
Subjects: cs.LG, cs.AI, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.05810
Pdf URL: https://arxiv.org/pdf/2503.05810
Copy Paste: [[2503.05810]] A Transformer Model for Predicting Chemical Reaction Products from Generic Templates(https://arxiv.org/abs/2503.05810)
Keywords: transformer
Abstract: The accurate prediction of chemical reaction outcomes is a major challenge in computational chemistry. Current models rely heavily on either highly specific reaction templates or template-free methods, both of which present limitations. To address these limitations, this work proposes the Broad Reaction Set (BRS), a dataset featuring 20 generic reaction templates that allow for the efficient exploration of the chemical space. Additionally, ProPreT5 is introduced, a T5 model tailored to chemistry that achieves a balance between rigid templates and template-free methods. ProPreT5 demonstrates its capability to generate accurate, valid, and realistic reaction products, making it a promising solution that goes beyond the current state-of-the-art on the complex reaction product prediction task.

Title: Randomized based restricted kernel machine for hyperspectral image classification

Authors: A. Quadir, M. Tanveer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.05837
Pdf URL: https://arxiv.org/pdf/2503.05837
Copy Paste: [[2503.05837]] Randomized based restricted kernel machine for hyperspectral image classification(https://arxiv.org/abs/2503.05837)
Keywords: robust, interpretability
Abstract: In recent years, the random vector functional link (RVFL) network has gained significant popularity in hyperspectral image (HSI) classification due to its simplicity, speed, and strong generalization performance. However, despite these advantages, RVFL models face several limitations, particularly in handling non-linear relationships and complex data structures. The random initialization of input-to-hidden weights can lead to instability, and the model struggles with determining the optimal number of hidden nodes, affecting its performance on more challenging datasets. To address these issues, we propose a novel randomized based restricted kernel machine ($R^2KM$) model that combines the strehyperngths of RVFL and restricted kernel machines (RKM). $R^2KM$ introduces a layered structure that represents kernel methods using both visible and hidden variables, analogous to the energy function in restricted Boltzmann machines (RBM). This structure enables $R^2KM$ to capture complex data interactions and non-linear relationships more effectively, improving both interpretability and model robustness. A key contribution of $R^2KM$ is the introduction of a novel conjugate feature duality based on the Fenchel-Young inequality, which expresses the problem in terms of conjugate dual variables and provides an upper bound on the objective function. This duality enhances the model's flexibility and scalability, offering a more efficient and flexible solution for complex data analysis tasks. Extensive experiments on hyperspectral image datasets and real-world data from the UCI and KEEL repositories show that $R^2KM$ outperforms baseline models, demonstrating its effectiveness in classification and regression tasks.

Title: Enhancing AUTOSAR-Based Firmware Over-the-Air Updates in the Automotive Industry with a Practical Implementation on a Steering System

Authors: Mostafa Ahmed Mostafa Ahmed, Mohamed Khaled Mohamed Elsayed, Radwa Waheed Ezzat Abdelmohsen
Subjects: cs.CR, cs.CV, eess.SY
Abstract URL: https://arxiv.org/abs/2503.05839
Pdf URL: https://arxiv.org/pdf/2503.05839
Copy Paste: [[2503.05839]] Enhancing AUTOSAR-Based Firmware Over-the-Air Updates in the Automotive Industry with a Practical Implementation on a Steering System(https://arxiv.org/abs/2503.05839)
Keywords: secure, security
Abstract: The automotive industry is increasingly reliant on software to manage complex vehicle functionalities, making efficient and secure firmware updates essential. Traditional firmware update methods, requiring physical connections through On-Board Diagnostics (OBD) ports, are inconvenient, costly, and time-consuming. Firmware Over-the-Air (FOTA) technology offers a revolutionary solution by enabling wireless updates, reducing operational costs, and enhancing the user experience. This project aims to design and implement an advanced FOTA system tailored for modern vehicles, incorporating the AUTOSAR architecture for scalability and standardization, and utilizing delta updating to minimize firmware update sizes, thereby improving bandwidth efficiency and reducing flashing times. To ensure security, the system integrates the UDS 0x27 protocol for authentication and data integrity during the update process. Communication between Electronic Control Units (ECUs) is achieved using the CAN protocol, while the ESP8266 module and the master ECU communicate via SPI for data transfer. The system's architecture includes key components such as a bootloader, boot manager, and bootloader updater to facilitate seamless firmware updates. The functionality of the system is demonstrated through two applications: a blinking LED and a Lane Keeping Assist (LKA) system, showcasing its versatility in handling critical automotive features. This project represents a significant step forward in automotive technology, offering a user-centric, efficient, and secure solution for automotive firmware management.

Title: Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA

Authors: Nils Graef, Andrew Wasielewski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05840
Pdf URL: https://arxiv.org/pdf/2503.05840
Copy Paste: [[2503.05840]] Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA(https://arxiv.org/abs/2503.05840)
Keywords: transformer
Abstract: Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows. Slim attention is an exact, mathematically identical implementation of the standard attention mechanism and therefore does not compromise model accuracy. In other words, slim attention losslessly compresses the context memory by a factor of 2. For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example. And for rare cases where the MHA projection dimension is larger than the embedding dimension, the memory can be reduced by a factor of 32 for the T5-11B model for example. See this https URL for code and more transformer tricks, and this https URL for a video about this paper.

Title: Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs

Authors: Hamin Koo, Jaehyung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05846
Pdf URL: https://arxiv.org/pdf/2503.05846
Copy Paste: [[2503.05846]] Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs(https://arxiv.org/abs/2503.05846)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success, but their English-centric training data limits performance in non-English languages, highlighting the need for enhancements in their multilingual capabilities. While some work on multilingual prompting methods handles non-English queries by utilizing English translations or restructuring them to more closely align with LLM reasoning patterns, these works often overlook the importance of cultural context, limiting their effectiveness. To address this limitation, we propose EMCEI, a simple yet effective approach that improves LLMs' multilingual capabilities by incorporating cultural context for more accurate and appropriate responses. Specifically, EMCEI follows a two-step process that first extracts relevant cultural context from the LLM's parametric knowledge via prompting. Then, EMCEI employs an LLM-as-Judge mechanism to select the most appropriate response by balancing cultural relevance and reasoning ability. Experiments on diverse multilingual benchmarks show that EMCEI outperforms existing baselines, demonstrating its effectiveness in handling multilingual queries with LLMs.

Title: Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis

Authors: Sefik Serengil, Alper Ozpinar
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05850
Pdf URL: https://arxiv.org/pdf/2503.05850
Copy Paste: [[2503.05850]] Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis(https://arxiv.org/abs/2503.05850)
Keywords: secure, security, privacy, large language model
Abstract: This paper explores the use of partially homomorphic encryption (PHE) for encrypted vector similarity search, with a focus on facial recognition and broader applications like reverse image search, recommendation engines, and large language models (LLMs). While fully homomorphic encryption (FHE) exists, we demonstrate that encrypted cosine similarity can be computed using PHE, offering a more practical alternative. Since PHE does not directly support cosine similarity, we propose a method that normalizes vectors in advance, enabling dot product calculations as a proxy. We also apply min-max normalization to handle negative dimension values. Experiments on the Labeled Faces in the Wild (LFW) dataset use DeepFace's FaceNet128d, FaceNet512d, and VGG-Face (4096d) models in a two-tower setup. Pre-encrypted embeddings are stored in one tower, while an edge device captures images, computes embeddings, and performs encrypted-plaintext dot products via additively homomorphic encryption. We implement this with LightPHE, evaluating Paillier, Damgard-Jurik, and Okamoto-Uchiyama schemes, excluding others due to performance or decryption complexity. Tests at 80-bit and 112-bit security (NIST-secure until 2030) compare PHE against FHE (via TenSEAL), analyzing encryption, decryption, operation time, cosine similarity loss, key/ciphertext sizes. Results show PHE is less computationally intensive, faster, and produces smaller ciphertexts/keys, making it well-suited for memory-constrained environments and real-world privacy-preserving encrypted similarity search.

Title: This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05856
Pdf URL: https://arxiv.org/pdf/2503.05856
Copy Paste: [[2503.05856]] This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs(https://arxiv.org/abs/2503.05856)
Keywords: defense, robust, large language model
Abstract: Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

Title: QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation

Authors: Bang Nguyen, Tingting Du, Mengxia Yu, Lawrence Angrave, Meng Jiang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05888
Pdf URL: https://arxiv.org/pdf/2503.05888
Copy Paste: [[2503.05888]] QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation(https://arxiv.org/abs/2503.05888)
Keywords: robust, large language model
Abstract: While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.

Title: MastermindEval: A Simple But Scalable Reasoning Benchmark

Authors: Jonas Golde, Patrick Haller, Fabio Barth, Alan Akbik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05891
Pdf URL: https://arxiv.org/pdf/2503.05891
Copy Paste: [[2503.05891]] MastermindEval: A Simple But Scalable Reasoning Benchmark(https://arxiv.org/abs/2503.05891)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of language understanding and mathematical tasks. As a result, increasing attention has been given to assessing the true reasoning capabilities of LLMs, driving research into commonsense, numerical, logical, and qualitative reasoning. However, with the rapid progress of reasoning-focused models such as OpenAI's o1 and DeepSeek's R1, there has been a growing demand for reasoning benchmarks that can keep pace with ongoing model developments. In this paper, we introduce MastermindEval, a simple, scalable, and interpretable deductive reasoning benchmark inspired by the board game Mastermind. Our benchmark supports two evaluation paradigms: (1) agentic evaluation, in which the model autonomously plays the game, and (2) deductive reasoning evaluation, in which the model is given a pre-played game state with only one possible valid code to infer. In our experimental results we (1) find that even easy Mastermind instances are difficult for current models and (2) demonstrate that the benchmark is scalable to possibly more advanced models in the future Furthermore, we investigate possible reasons why models cannot deduce the final solution and find that current models are limited in deducing the concealed code as the number of statement to combine information from is increasing.

Title: Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

Authors: Ekaterina Redekop, Zichen Wang, Rushikesh Kulkarni, Mara Pleasure, Aaron Chin, Hamid Reza Hassanzadeh, Brian L. Hill, Melika Emami, William Speier, Corey W. Arnold
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05893
Pdf URL: https://arxiv.org/pdf/2503.05893
Copy Paste: [[2503.05893]] Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records(https://arxiv.org/abs/2503.05893)
Keywords: robust, transformer, generative
Abstract: Longitudinal data in electronic health records (EHRs) represent an individual`s clinical history through a sequence of codified concepts, including diagnoses, procedures, medications, and laboratory tests. Foundational models, such as generative pre-trained transformers (GPT), can leverage this data to predict future events. While fine-tuning of these models enhances task-specific performance, it is costly, complex, and unsustainable for every target. We show that a foundation model trained on EHRs can perform predictive tasks in a zero-shot manner, eliminating the need for fine-tuning. This study presents the first comprehensive analysis of zero-shot forecasting with GPT-based foundational models in EHRs, introducing a novel pipeline that formulates medical concept prediction as a generative modeling task. Unlike supervised approaches requiring extensive labeled data, our method enables the model to forecast a next medical event purely from a pretraining knowledge. We evaluate performance across multiple time horizons and clinical categories, demonstrating model`s ability to capture latent temporal dependencies and complex patient trajectories without task supervision. Model performance for predicting the next medical concept was evaluated using precision and recall metrics, achieving an average top1 precision of 0.614 and recall of 0.524. For 12 major diagnostic conditions, the model demonstrated strong zero-shot performance, achieving high true positive rates while maintaining low false positives. We demonstrate the power of a foundational EHR GPT model in capturing diverse phenotypes and enabling robust, zero-shot forecasting of clinical outcomes. This capability enhances the versatility of predictive healthcare models and reduces the need for task-specific training, enabling more scalable applications in clinical settings.

Title: IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining

Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05920
Pdf URL: https://arxiv.org/pdf/2503.05920
Copy Paste: [[2503.05920]] IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining(https://arxiv.org/abs/2503.05920)
Keywords: generative, large language model
Abstract: Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.

Title: DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization

Authors: Yasir Khan, Xinlei Wu, Sangpil Youm, Justin Ho, Aryaan Shaikh, Jairo Garciga, Rohan Sharma, Bonnie J. Dorr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05935
Pdf URL: https://arxiv.org/pdf/2503.05935
Copy Paste: [[2503.05935]] DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization(https://arxiv.org/abs/2503.05935)
Keywords: transformer, large language model
Abstract: Query-focused tabular summarization is an emerging task in table-to-text generation that synthesizes a summary response from tabular data based on user queries. Traditional transformer-based approaches face challenges due to token limitations and the complexity of reasoning over large tables. To address these challenges, we introduce DETQUS (Decomposition-Enhanced Transformers for QUery-focused Summarization), a system designed to improve summarization accuracy by leveraging tabular decomposition alongside a fine-tuned encoder-decoder model. DETQUS employs a large language model to selectively reduce table size, retaining only query-relevant columns while preserving essential information. This strategy enables more efficient processing of large tables and enhances summary quality. Our approach, equipped with table-based QA model Omnitab, achieves a ROUGE-L score of 0.4437, outperforming the previous state-of-the-art REFACTOR model (ROUGE-L: 0.422). These results highlight DETQUS as a scalable and effective solution for query-focused tabular summarization, offering a structured alternative to more complex architectures.

Title: CASP: Compression of Large Multimodal Models Based on Attention Sparsity

Authors: Mohsen Gholami, Mohammad Akbari, Kevin Cannons, Yong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05936
Pdf URL: https://arxiv.org/pdf/2503.05936
Copy Paste: [[2503.05936]] CASP: Compression of Large Multimodal Models Based on Attention Sparsity(https://arxiv.org/abs/2503.05936)
Keywords: large language model
Abstract: In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks.

Title: Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting

Authors: Dominic Maggio, Luca Carlone
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05949
Pdf URL: https://arxiv.org/pdf/2503.05949
Copy Paste: [[2503.05949]] Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting(https://arxiv.org/abs/2503.05949)
Keywords: extraction
Abstract: Open-set semantic mapping requires (i) determining the correct granularity to represent the scene (e.g., how should objects be defined), and (ii) fusing semantic knowledge across multiple 2D observations into an overall 3D reconstruction -ideally with a high-fidelity yet low-memory footprint. While most related works bypass the first issue by grouping together primitives with similar semantics (according to some manually tuned threshold), we recognize that the object granularity is task-dependent, and develop a task-driven semantic mapping approach. To address the second issue, current practice is to average visual embedding vectors over multiple views. Instead, we show the benefits of using a probabilistic approach based on the properties of the underlying visual-language foundation model, and leveraging Bayesian updating to aggregate multiple observations of the scene. The result is Bayesian Fields, a task-driven and probabilistic approach for open-set semantic mapping. To enable high-fidelity objects and a dense scene representation, Bayesian Fields uses 3D Gaussians which we cluster into task-relevant objects, allowing for both easy 3D object extraction and reduced memory usage. We release Bayesian Fields open-source at https: //github.com/MIT-SPARK/Bayesian-Fields.

Title: A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond

Authors: Mihaela Cătălina Stoian, Eleonora Giunchiglia, Thomas Lukasiewicz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05954
Pdf URL: https://arxiv.org/pdf/2503.05954
Copy Paste: [[2503.05954]] A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond(https://arxiv.org/abs/2503.05954)
Keywords: privacy, generative
Abstract: Generative modelling has become the standard approach for synthesising tabular data. However, different use cases demand synthetic data to comply with different requirements to be useful in practice. In this survey, we review deep generative modelling approaches for tabular data from the perspective of four types of requirements: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities. We group the approaches along two levels of granularity: (i) based on the primary type of requirements they address and (ii) according to the underlying model they utilise. Additionally, we summarise the appropriate evaluation methods for each requirement and the specific characteristics of each model type. Finally, we discuss future directions for the field, along with opportunities to improve the current evaluation methods. Overall, this survey can be seen as a user guide to tabular data generation: helping readers navigate available models and evaluation methods to find those best suited to their needs.

Title: SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc

Authors: Daniel Guzman-Olivares, Lara Quijano-Sanchez, Federico Liberatore
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05958
Pdf URL: https://arxiv.org/pdf/2503.05958
Copy Paste: [[2503.05958]] SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc(https://arxiv.org/abs/2503.05958)
Keywords: generative, large language model
Abstract: The rise of generative chat-based Large Language Models (LLMs) over the past two years has spurred a race to develop systems that promise near-human conversational and reasoning experiences. However, recent studies indicate that the language understanding offered by these models remains limited and far from human-like performance, particularly in grasping the contextual meanings of words, an essential aspect of reasoning. In this paper, we present a simple yet computationally efficient framework for multilingual Word Sense Disambiguation (WSD). Our approach reframes the WSD task as a cluster discrimination analysis over a semantic network refined from BabelNet using group algebra. We validate our methodology across multiple WSD benchmarks, achieving a new state of the art for all languages and tasks, as well as in individual assessments by part of speech. Notably, our model significantly surpasses the performance of current alternatives, even in low-resource languages, while reducing the parameter count by 72%.

Title: Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Authors: Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova
Subjects: cs.LG, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2503.05965
Pdf URL: https://arxiv.org/pdf/2503.05965
Copy Paste: [[2503.05965]] Validating LLM-as-a-Judge Systems in the Absence of Gold Labels(https://arxiv.org/abs/2503.05965)
Keywords: generative
Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.

Title: Generative Multi-Agent Q-Learning for Policy Optimization: Decentralized Wireless Networks

Authors: Talha Bozkus, Urbashi Mitra
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.05970
Pdf URL: https://arxiv.org/pdf/2503.05970
Copy Paste: [[2503.05970]] Generative Multi-Agent Q-Learning for Policy Optimization: Decentralized Wireless Networks(https://arxiv.org/abs/2503.05970)
Keywords: generative
Abstract: Q-learning is a widely used reinforcement learning (RL) algorithm for optimizing wireless networks, but faces challenges with large state-spaces. Recently proposed multi-environment mixed Q-learning (MEMQ) algorithm addresses these challenges by employing multiple Q-learning algorithms across multiple synthetically generated, distinct but structurally related environments, so-called digital cousins. In this paper, we propose a novel multi-agent MEMQ (M-MEMQ) for cooperative decentralized wireless networks with multiple networked transmitters (TXs) and base stations (BSs). TXs do not have access to global information (joint state and actions). The new concept of coordinated and uncoordinated states is introduced. In uncoordinated states, TXs act independently to minimize their individual costs and update local Q-functions. In coordinated states, TXs use a Bayesian approach to estimate the joint state and update the joint Q-functions. The cost of information-sharing scales linearly with the number of TXs and is independent of the joint state-action space size. Several theoretical guarantees, including deterministic and probabilistic convergence, bounds on estimation error variance, and the probability of misdetecting the joint states, are given. Numerical simulations show that M-MEMQ outperforms several decentralized and centralized training with decentralized execution (CTDE) multi-agent RL algorithms by achieving 55% lower average policy error (APE), 35% faster convergence, 50% reduced runtime complexity, and 45% less sample complexity. Furthermore, M-MEMQ achieves comparable APE with significantly lower complexity than centralized methods. Simulations validate the theoretical analyses.

Title: A Real-time Multimodal Transformer Neural Network-powered Wildfire Forecasting System

Authors: Qijun Chen, Shaofan Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05971
Pdf URL: https://arxiv.org/pdf/2503.05971
Copy Paste: [[2503.05971]] A Real-time Multimodal Transformer Neural Network-powered Wildfire Forecasting System(https://arxiv.org/abs/2503.05971)
Keywords: transformer
Abstract: Due to climate change, the extreme wildfire has become one of the most dangerous natural hazards to human civilization. Even though, some wildfires may be initially caused by human activity, but the spread of wildfires is mainly determined by environmental factors, for examples, (1) weather conditions such as temperature, wind direction and intensity, and moisture levels; (2) the amount and types of dry vegetation in a local area, and (3) topographic or local terrian conditions, which affects how much rain an area gets and how fire dynamics will be constrained or faciliated. Thus, to accurately forecast wildfire occurrence has become one of most urgent and taunting environmental challenges in global scale. In this work, we developed a real-time Multimodal Transformer Neural Network Machine Learning model that combines several advanced artificial intelligence techniques and statistical methods to practically forecast the occurrence of wildfire at the precise location in real time, which not only utilizes large scale data information such as hourly weather forecasting data, but also takes into account small scale topographical data such as local terrain condition and local vegetation conditions collecting from Google Earth images to determine the probabilities of wildfire occurrence location at small scale as well as their timing synchronized with weather forecast information. By using the wildfire data in the United States from 1992 to 2015 to train the multimodal transformer neural network, it can predict the probabilities of wildfire occurrence according to the real-time weather forecast and the synchronized Google Earth image data to provide the wildfire occurrence probability in any small location ($100m^2$) within 24 hours ahead.

Title: Is Your Video Language Model a Reliable Judge?

Authors: Ming Liu, Wensheng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05977
Pdf URL: https://arxiv.org/pdf/2503.05977
Copy Paste: [[2503.05977]] Is Your Video Language Model a Reliable Judge?(https://arxiv.org/abs/2503.05977)
Keywords: robust
Abstract: As video language models (VLMs) gain more applications in various scenarios, the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methods such as employing VLMs to evaluate VLMs. However, the reliability of VLMs as judges remains underexplored. Existing methods often rely on a single VLM as the evaluator. However, this approach can be unreliable or biased because such a model may lack the ability to fully understand the content and may have inherent biases, ultimately compromising evaluation reliability. A remedy is to apply the principle of collective thoughts, aggregating evaluations from multiple VLMs to enhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our findings reveal that incorporating collective judgments from such a mixed pool does not necessarily improve the accuracy of the final evaluation. The inclusion of less reliable judges can introduce noise, undermining the overall reliability of the outcomes. To explore the factors that impact evaluation reliability, we fine-tune an underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. These findings stress the limitations of collective thought approaches and highlight the need for more advanced methods that can account for the reliability of individual models. Our study promotes the development of more reliable evaluation methods for VLMs

Title: MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Authors: Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05978
Pdf URL: https://arxiv.org/pdf/2503.05978
Copy Paste: [[2503.05978]] MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice(https://arxiv.org/abs/2503.05978)
Keywords: diffusion, transformer
Abstract: We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at this https URL, with examples at this https URL.

Title: SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs

Authors: Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05980
Pdf URL: https://arxiv.org/pdf/2503.05980
Copy Paste: [[2503.05980]] SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs(https://arxiv.org/abs/2503.05980)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly deployed across diverse domains, yet they are prone to generating factually incorrect outputs - commonly known as "hallucinations." Among existing mitigation strategies, uncertainty-based methods are particularly attractive due to their ease of implementation, independence from external data, and compatibility with standard LLMs. In this work, we introduce a novel and scalable uncertainty-based semantic clustering framework for automated hallucination detection. Our approach leverages sentence embeddings and hierarchical clustering alongside a newly proposed inconsistency measure, SINdex, to yield more homogeneous clusters and more accurate detection of hallucination phenomena across various LLMs. Evaluations on prominent open- and closed-book QA datasets demonstrate that our method achieves AUROC improvements of up to 9.3% over state-of-the-art techniques. Extensive ablation studies further validate the effectiveness of each component in our framework.

Title: Black Box Causal Inference: Effect Estimation via Meta Prediction

Authors: Lucius E.J. Bynum, Aahlad Manas Puli, Diego Herrero-Quevedo, Nhi Nguyen, Carlos Fernandez-Granda, Kyunghyun Cho, Rajesh Ranganath
Subjects: cs.LG, cs.AI, stat.CO, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2503.05985
Pdf URL: https://arxiv.org/pdf/2503.05985
Copy Paste: [[2503.05985]] Black Box Causal Inference: Effect Estimation via Meta Prediction(https://arxiv.org/abs/2503.05985)
Keywords: robust
Abstract: Causal inference and the estimation of causal effects plays a central role in decision-making across many areas, including healthcare and economics. Estimating causal effects typically requires an estimator that is tailored to each problem of interest. But developing estimators can take significant effort for even a single causal inference setting. For example, algorithms for regression-based estimators, propensity score methods, and doubly robust methods were designed across several decades to handle causal estimation with observed confounders. Similarly, several estimators have been developed to exploit instrumental variables (IVs), including two-stage least-squares (TSLS), control functions, and the method-of-moments. In this work, we instead frame causal inference as a dataset-level prediction problem, offloading algorithm design to the learning process. The approach we introduce, called black box causal inference (BBCI), builds estimators in a black-box manner by learning to predict causal effects from sampled dataset-effect pairs. We demonstrate accurate estimation of average treatment effects (ATEs) and conditional average treatment effects (CATEs) with BBCI across several causal inference problems with known identification, including problems with less developed estimators.

Title: Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

Authors: Md Azim Khan, Aryya Gangopadhyay, Jianwu Wang, Robert F. Erbacher
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06003
Pdf URL: https://arxiv.org/pdf/2503.06003
Copy Paste: [[2503.06003]] Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models(https://arxiv.org/abs/2503.06003)
Keywords: robust, extraction
Abstract: Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).

Title: Nearly Optimal Differentially Private ReLU Regression

Authors: Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06009
Pdf URL: https://arxiv.org/pdf/2503.06009
Copy Paste: [[2503.06009]] Nearly Optimal Differentially Private ReLU Regression(https://arxiv.org/abs/2503.06009)
Keywords: privacy, attack
Abstract: In this paper, we investigate one of the most fundamental nonconvex learning problems, ReLU regression, in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i.i.d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.

Title: Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

Authors: Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06011
Pdf URL: https://arxiv.org/pdf/2503.06011
Copy Paste: [[2503.06011]] Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models(https://arxiv.org/abs/2503.06011)
Keywords: robust, large language model
Abstract: Self-Correction based on feedback improves the output quality of Large Language Models (LLMs). Moreover, as Self-Correction functions like the slow and conscious System-2 thinking from cognitive psychology's perspective, it can potentially reduce LLMs' social biases. LLMs are sensitive to contextual ambiguities and inconsistencies; therefore, explicitly communicating their intentions during interactions when applying Self-Correction for debiasing is crucial. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models for response and feedback generation.

Title: End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Authors: Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yibing Zhan, Dongjiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06012
Pdf URL: https://arxiv.org/pdf/2503.06012
Copy Paste: [[2503.06012]] End-to-End HOI Reconstruction Transformer with Graph-based Encoding(https://arxiv.org/abs/2503.06012)
Keywords: transformer
Abstract: With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

Title: Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Authors: Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.06014
Pdf URL: https://arxiv.org/pdf/2503.06014
Copy Paste: [[2503.06014]] Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity(https://arxiv.org/abs/2503.06014)
Keywords: robust
Abstract: Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at this https URL.

Title: GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Authors: Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06019
Pdf URL: https://arxiv.org/pdf/2503.06019
Copy Paste: [[2503.06019]] GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices(https://arxiv.org/abs/2503.06019)
Keywords: transformer, large language model
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

Title: FedEM: A Privacy-Preserving Framework for Concurrent Utility Preservation in Federated Learning

Authors: Mingcong Xu, Xiaojin Zhang, Wei Chen, Hai Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06021
Pdf URL: https://arxiv.org/pdf/2503.06021
Copy Paste: [[2503.06021]] FedEM: A Privacy-Preserving Framework for Concurrent Utility Preservation in Federated Learning(https://arxiv.org/abs/2503.06021)
Keywords: privacy, protect, attack, robust, federate
Abstract: Federated Learning (FL) enables collaborative training of models across distributed clients without sharing local data, addressing privacy concerns in decentralized systems. However, the gradient-sharing process exposes private data to potential leakage, compromising FL's privacy guarantees in real-world applications. To address this issue, we propose Federated Error Minimization (FedEM), a novel algorithm that incorporates controlled perturbations through adaptive noise injection. This mechanism effectively mitigates gradient leakage attacks while maintaining model performance. Experimental results on benchmark datasets demonstrate that FedEM significantly reduces privacy risks and preserves model accuracy, achieving a robust balance between privacy protection and utility preservation.

Title: Data-Free Black-Box Federated Learning via Zeroth-Order Gradient Estimation

Authors: Xinge Ma, Jin Wang, Xuejie Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06028
Pdf URL: https://arxiv.org/pdf/2503.06028
Copy Paste: [[2503.06028]] Data-Free Black-Box Federated Learning via Zeroth-Order Gradient Estimation(https://arxiv.org/abs/2503.06028)
Keywords: privacy, protect, federate, data-free
Abstract: Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server without exposing their individual data. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillation-based FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it highly relies on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection.

Title: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

Authors: Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06029
Pdf URL: https://arxiv.org/pdf/2503.06029
Copy Paste: [[2503.06029]] SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?(https://arxiv.org/abs/2503.06029)
Keywords: extraction, large language model
Abstract: Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q\&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at this https URL.

Title: Towards Universal Text-driven CT Image Segmentation

Authors: Yuheng Li, Yuxiang Lai, Maria Thor, Deborah Marshall, Zachary Buchwald, David S. Yu, Xiaofeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06030
Pdf URL: https://arxiv.org/pdf/2503.06030
Copy Paste: [[2503.06030]] Towards Universal Text-driven CT Image Segmentation(https://arxiv.org/abs/2503.06030)
Keywords: transformer, large language model, segmentation
Abstract: Computed tomography (CT) is extensively used for accurate visualization and segmentation of organs and lesions. While deep learning models such as convolutional neural networks (CNNs) and vision transformers (ViTs) have significantly improved CT image analysis, their performance often declines when applied to diverse, real-world clinical data. Although foundation models offer a broader and more adaptable solution, their potential is limited due to the challenge of obtaining large-scale, voxel-level annotations for medical images. In response to these challenges, prompting-based models using visual or text prompts have emerged. Visual-prompting methods, such as the Segment Anything Model (SAM), still require significant manual input and can introduce ambiguity when applied to clinical scenarios. Instead, foundation models that use text prompts offer a more versatile and clinically relevant approach. Notably, current text-prompt models, such as the CLIP-Driven Universal Model, are limited to text prompts already encountered during training and struggle to process the complex and diverse scenarios of real-world clinical applications. Instead of fine-tuning models trained from natural imaging, we propose OpenVocabCT, a vision-language model pretrained on large-scale 3D CT images for universal text-driven segmentation. Using the large-scale CT-RATE dataset, we decompose the diagnostic reports into fine-grained, organ-level descriptions using large language models for multi-granular contrastive learning. We evaluate our OpenVocabCT on downstream segmentation tasks across nine public datasets for organ and tumor segmentation, demonstrating the superior performance of our model compared to existing methods. All code, datasets, and models will be publicly released at this https URL.

Title: A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep Learning

Authors: Hongtao Wang, Jiandong Liang, Lei Wang, Shuaizhe Liang, Jinping Zhu, Chunxia Zhang, Jiangshe Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06038
Pdf URL: https://arxiv.org/pdf/2503.06038
Copy Paste: [[2503.06038]] A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep Learning(https://arxiv.org/abs/2503.06038)
Keywords: robust, segmentation
Abstract: Residual moveout (RMO) provides critical information for travel time tomography. The current industry-standard method for fitting RMO involves scanning high-order polynomial equations. However, this analytical approach does not accurately capture local saltation, leading to low iteration efficiency in tomographic inversion. Supervised learning-based image segmentation methods for picking can effectively capture local variations; however, they encounter challenges such as a scarcity of reliable training samples and the high complexity of post-processing. To address these issues, this study proposes a deep learning-based cascade picking method. It distinguishes accurate and robust RMOs using a segmentation network and a post-processing technique based on trend regression. Additionally, a data synthesis method is introduced, enabling the segmentation network to be trained on synthetic datasets for effective picking in field data. Furthermore, a set of metrics is proposed to quantify the quality of automatically picked RMOs. Experimental results based on both model and real data demonstrate that, compared to semblance-based methods, our approach achieves greater picking density and accuracy.

Title: Mitigating Memorization in LLMs using Activation Steering

Authors: Manan Suri, Nishit Anand, Amisha Bhaskar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06040
Pdf URL: https://arxiv.org/pdf/2503.06040
Copy Paste: [[2503.06040]] Mitigating Memorization in LLMs using Activation Steering(https://arxiv.org/abs/2503.06040)
Keywords: privacy, large language model
Abstract: The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.

Title: Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

Authors: Jiaming Liu, Linghe Kong, Guihai Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06042
Pdf URL: https://arxiv.org/pdf/2503.06042
Copy Paste: [[2503.06042]] Improving SAM for Camouflaged Object Detection via Dual Stream Adapters(https://arxiv.org/abs/2503.06042)
Keywords: segmentation
Abstract: Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-COD achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.

Title: Constructions are Revealed in Word Distributions

Authors: Joshua Rozner, Leonie Weissweiler, Kyle Mahowald, Cory Shain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06048
Pdf URL: https://arxiv.org/pdf/2503.06048
Copy Paste: [[2503.06048]] Constructions are Revealed in Word Distributions(https://arxiv.org/abs/2503.06048)
Keywords: robust
Abstract: Construction grammar posits that constructions (form-meaning pairings) are acquired through experience with language (the distributional learning hypothesis). But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what caused a particular word to occur. For that, we need computable models of the distribution over strings -- namely, pretrained language models (PLMs). Here we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity. We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) schematic constructions, whose "slots" can be filled by abstract word classes. Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text. Thus, statistical affinity is likely an important, but partial, signal available to learners.

Title: Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases

Authors: Suvendu Mohanty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06054
Pdf URL: https://arxiv.org/pdf/2503.06054
Copy Paste: [[2503.06054]] Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases(https://arxiv.org/abs/2503.06054)
Keywords: fair, interpretability, generative, large language model
Abstract: Recent advancements in Artificial Intelligence, particularly in Large Language Models (LLMs), have transformed natural language processing by improving generative capabilities. However, detecting biases embedded within these models remains a challenge. Subtle biases can propagate misinformation, influence decision-making, and reinforce stereotypes, raising ethical concerns. This study presents a detection framework to identify nuanced biases in LLMs. The approach integrates contextual analysis, interpretability via attention mechanisms, and counterfactual data augmentation to capture hidden biases across linguistic contexts. The methodology employs contrastive prompts and synthetic datasets to analyze model behaviour across cultural, ideological, and demographic scenarios. Quantitative analysis using benchmark datasets and qualitative assessments through expert reviews validate the effectiveness of the framework. Results show improvements in detecting subtle biases compared to conventional methods, which often fail to highlight disparities in model responses to race, gender, and socio-political contexts. The framework also identifies biases arising from imbalances in training data and model architectures. Continuous user feedback ensures adaptability and refinement. This research underscores the importance of proactive bias mitigation strategies and calls for collaboration between policymakers, AI developers, and regulators. The proposed detection mechanisms enhance model transparency and support responsible LLM deployment in sensitive applications such as education, legal systems, and healthcare. Future work will focus on real-time bias monitoring and cross-linguistic generalization to improve fairness and inclusivity in AI-driven communication tools.

Title: Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

Authors: Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin, Hui Su, Jinlan Fu, Xiaoyu Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06063
Pdf URL: https://arxiv.org/pdf/2503.06063
Copy Paste: [[2503.06063]] Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices(https://arxiv.org/abs/2503.06063)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations. We make all our code publicly available: this https URL.

Title: TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking

Authors: Hangyu Du, Chee-Meng Chew
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06071
Pdf URL: https://arxiv.org/pdf/2503.06071
Copy Paste: [[2503.06071]] TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking(https://arxiv.org/abs/2503.06071)
Keywords: transformer
Abstract: In recent years, fully differentiable end-to-end autonomous driving systems have become a research hotspot in the field of intelligent transportation. Among various research directions, automatic parking is particularly critical as it aims to enable precise vehicle parking in complex environments. In this paper, we present a purely vision-based transformer model for end-to-end automatic parking, trained using expert trajectories. Given camera-captured data as input, the proposed model directly outputs future trajectory coordinates. Experimental results demonstrate that the various errors of our model have decreased by approximately 50% in comparison with the current state-of-the-art end-to-end trajectory prediction algorithm of the same type. Our approach thus provides an effective solution for fully differentiable automatic parking.

Title: A Survey on Post-training of Large Language Models

Authors: Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06072
Pdf URL: https://arxiv.org/pdf/2503.06072
Copy Paste: [[2503.06072]] A Survey on Post-training of Large Language Models(https://arxiv.org/abs/2503.06072)
Keywords: robust, large language model
Abstract: The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; and Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's foundational alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.

Title: GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Authors: Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06073
Pdf URL: https://arxiv.org/pdf/2503.06073
Copy Paste: [[2503.06073]] GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images(https://arxiv.org/abs/2503.06073)
Keywords: explainability, large language model
Abstract: While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: this https URL

Title: Towards Conversational AI for Disease Management

Authors: Anil Palepu, Valentin Liévin, Wei-Hung Weng, Khaled Saab, David Stutz, Yong Cheng, Kavita Kulkarni, S. Sara Mahdavi, Joëlle Barral, Dale R. Webster, Katherine Chou, Avinatan Hassidim, Yossi Matias, James Manyika, Ryutaro Tanno, Vivek Natarajan, Adam Rodman, Tao Tu, Alan Karthikesalingam, Mike Schaekermann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06074
Pdf URL: https://arxiv.org/pdf/2503.06074
Copy Paste: [[2503.06074]] Towards Conversational AI for Disease Management(https://arxiv.org/abs/2503.06074)
Keywords: large language model
Abstract: While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.

Title: An Empirical Study of Causal Relation Extraction Transfer: Design and Data

Authors: Sydney Anuyah, Jack Vanschaik, Palak Jain, Sawyer Lehman, Sunandan Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06076
Pdf URL: https://arxiv.org/pdf/2503.06076
Copy Paste: [[2503.06076]] An Empirical Study of Causal Relation Extraction Transfer: Design and Data(https://arxiv.org/abs/2503.06076)
Keywords: extraction
Abstract: We conduct an empirical analysis of neural network architectures and data transfer strategies for causal relation extraction. By conducting experiments with various contextual embedding layers and architectural components, we show that a relatively straightforward BioBERT-BiGRU relation extraction model generalizes better than other architectures across varying web-based sources and annotation strategies. Furthermore, we introduce a metric for evaluating transfer performance, $F1_{phrase}$ that emphasizes noun phrase localization rather than directly matching target tags. Using this metric, we can conduct data transfer experiments, ultimately revealing that augmentation with data with varying domains and annotation styles can improve performance. Data augmentation is especially beneficial when an adequate proportion of implicitly and explicitly causal sentences are included.

Title: Biased Federated Learning under Wireless Heterogeneity

Authors: Muhammad Faraz Ul Abrar, Nicolò Michelusi
Subjects: cs.LG, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2503.06078
Pdf URL: https://arxiv.org/pdf/2503.06078
Copy Paste: [[2503.06078]] Biased Federated Learning under Wireless Heterogeneity(https://arxiv.org/abs/2503.06078)
Keywords: federate
Abstract: Federated learning (FL) has emerged as a promising framework for distributed learning, enabling collaborative model training without sharing private data. Existing wireless FL works primarily adopt two communication strategies: (1) over-the-air (OTA) computation, which exploits wireless signal superposition for simultaneous gradient aggregation, and (2) digital communication, which allocates orthogonal resources for gradient uploads. Prior works on both schemes typically assume \emph{homogeneous} wireless conditions (equal path loss across devices) to enforce zero-bias updates or permit uncontrolled bias, resulting in suboptimal performance and high-variance model updates in \emph{heterogeneous} environments, where devices with poor channel conditions slow down convergence. This paper addresses FL over heterogeneous wireless networks by proposing novel OTA and digital FL updates that allow a structured, time-invariant model bias, thereby reducing variance in FL updates. We analyze their convergence under a unified framework and derive an upper bound on the model ``optimality error", which explicitly quantifies the effect of bias and variance in terms of design parameters. Next, to optimize this trade-off, we study a non-convex optimization problem and develop a successive convex approximation (SCA)-based framework to jointly optimize the design parameters. We perform extensive numerical evaluations with several related design variants and state-of-the-art OTA and digital FL schemes. Our results confirm that minimizing the bias-variance trade-off while allowing a structured bias provides better FL convergence performance than existing schemes.

Title: Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts

Authors: Yubin Wang, Xinyang Jiang, De Cheng, Xiangqian Zhao, Zilong Wang, Dongsheng Li, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06084
Pdf URL: https://arxiv.org/pdf/2503.06084
Copy Paste: [[2503.06084]] Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts(https://arxiv.org/abs/2503.06084)
Keywords: interpretability
Abstract: Visual prompt tuning offers significant advantages for adapting pre-trained visual foundation models to specific tasks. However, current research provides limited insight into the interpretability of this approach, which is essential for enhancing AI reliability and enabling AI-driven knowledge discovery. In this paper, rather than learning abstract prompt embeddings, we propose the first framework, named Interpretable Visual Prompt Tuning (IVPT), to explore interpretability for visual prompts, by introducing hierarchical concept prototypes. Specifically, visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes, each corresponding to a specific region of the image. Then, IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities. Comprehensive qualitative and quantitative evaluations on fine-grained classification benchmarks show its superior interpretability and performance over conventional visual prompt tuning methods and existing interpretable methods.

Title: Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

Authors: David C. Jeong, Aditya Puranik, James Vong, Vrushabh Abhijit Deogirikar, Ryan Fell, Julianna Dietrich, Maria Kyrarini, Christopher Kitts
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.06089
Pdf URL: https://arxiv.org/pdf/2503.06089
Copy Paste: [[2503.06089]] Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision(https://arxiv.org/abs/2503.06089)
Keywords: transformer
Abstract: Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models.

Title: Theta Theory: operads and coloring

Authors: Matilde Marcolli, Richard K. Larson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06091
Pdf URL: https://arxiv.org/pdf/2503.06091
Copy Paste: [[2503.06091]] Theta Theory: operads and coloring(https://arxiv.org/abs/2503.06091)
Keywords: generative
Abstract: We give an explicit construction of the generating set of a colored operad that implements theta theory in the mathematical model of Minimalism in generative linguistics, in the form of a coloring algorithm for syntactic objects. We show that the coproduct operation on workspaces allows for a recursive implementation of the theta criterion. We also show that this filtering by coloring rules on structures freely formed by Merge is equivalent to a process of structure formation by a colored version of Merge: the form of the generators of the colored operad then implies the dichotomy is semantics between External and Internal Merge, where Internal Merge only moves to non-theta positions.

Title: Clustering-based Meta Bayesian Optimization with Theoretical Guarantee

Authors: Khoa Nguyen, Viet Huynh, Binh Tran, Tri Pham, Tin Huynh, Thin Nguyen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06093
Pdf URL: https://arxiv.org/pdf/2503.06093
Copy Paste: [[2503.06093]] Clustering-based Meta Bayesian Optimization with Theoretical Guarantee(https://arxiv.org/abs/2503.06093)
Keywords: robust
Abstract: Bayesian Optimization (BO) is a well-established method for addressing black-box optimization problems. In many real-world scenarios, optimization often involves multiple functions, emphasizing the importance of leveraging data and learned functions from prior tasks to enhance efficiency in the current task. To expedite convergence to the global optimum, recent studies have introduced meta-learning strategies, collectively referred to as meta-BO, to incorporate knowledge from historical tasks. However, in practical settings, the underlying functions are often heterogeneous, which can adversely affect optimization performance for the current task. Additionally, when the number of historical tasks is large, meta-BO methods face significant scalability challenges. In this work, we propose a scalable and robust meta-BO method designed to address key challenges in heterogeneous and large-scale meta-tasks. Our approach (1) effectively partitions transferred meta-functions into highly homogeneous clusters, (2) learns the geometry-based surrogate prototype that capture the structural patterns within each cluster, and (3) adaptively synthesizes meta-priors during the online phase using statistical distance-based weighting policies. Experimental results on real-world hyperparameter optimization (HPO) tasks, combined with theoretical guarantees, demonstrate the robustness and effectiveness of our method in overcoming these challenges.

Title: PointDiffuse: A Dual-Conditional Diffusion Model for Enhanced Point Cloud Semantic Segmentation

Authors: Yong He, Hongshan Yu, Mingtao Feng, Tongjia Chen, Zechuan Li, Anwaar Ulhaq, Saeed Anwar, Ajmal Saeed Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06094
Pdf URL: https://arxiv.org/pdf/2503.06094
Copy Paste: [[2503.06094]] PointDiffuse: A Dual-Conditional Diffusion Model for Enhanced Point Cloud Semantic Segmentation(https://arxiv.org/abs/2503.06094)
Keywords: diffusion, transformer, segmentation
Abstract: Diffusion probabilistic models are traditionally used to generate colors at fixed pixel positions in 2D images. Building on this, we extend diffusion models to point cloud semantic segmentation, where point positions also remain fixed, and the diffusion model generates point labels instead of colors. To accelerate the denoising process in reverse diffusion, we introduce a noisy label embedding mechanism. This approach integrates semantic information into the noisy label, providing an initial semantic reference that improves the reverse diffusion efficiency. Additionally, we propose a point frequency transformer that enhances the adjustment of high-level context in point clouds. To reduce computational complexity, we introduce the position condition into MLP and propose denoising PointNet to process the high-resolution point cloud without sacrificing geometric details. Finally, we integrate the proposed noisy label embedding, point frequency transformer and denoising PointNet in our proposed dual conditional diffusion model-based network (PointDiffuse) to perform large-scale point cloud semantic segmentation. Extensive experiments on five benchmarks demonstrate the superiority of PointDiffuse, achieving the state-of-the-art mIoU of 74.2\% on S3DIS Area 5, 81.2\% on S3DIS 6-fold and 64.8\% on SWAN dataset.

Title: Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records

Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06096
Pdf URL: https://arxiv.org/pdf/2503.06096
Copy Paste: [[2503.06096]] Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records(https://arxiv.org/abs/2503.06096)
Keywords: privacy, robust
Abstract: Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.

Title: Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior

Authors: Xianjie Liu, Keren Fu, Qijun Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06100
Pdf URL: https://arxiv.org/pdf/2503.06100
Copy Paste: [[2503.06100]] Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior(https://arxiv.org/abs/2503.06100)
Keywords: diffusion, segmentation
Abstract: Dichotomous Image Segmentation (DIS) is a high-precision object segmentation task for high-resolution natural images. The current mainstream methods focus on the optimization of local details but overlook the fundamental challenge of modeling the integrity of objects. We have found that the depth integrity-prior implicit in the the pseudo-depth maps generated by Depth Anything Model v2 and the local detail features of image patches can jointly address the above dilemmas. Based on the above findings, we have designed a novel Patch-Depth Fusion Network (PDFNet) for high-precision dichotomous image segmentation. The core of PDFNet consists of three aspects. Firstly, the object perception is enhanced through multi-modal input fusion. By utilizing the patch fine-grained strategy, coupled with patch selection and enhancement, the sensitivity to details is improved. Secondly, by leveraging the depth integrity-prior distributed in the depth maps, we propose an integrity-prior loss to enhance the uniformity of the segmentation results in the depth maps. Finally, we utilize the features of the shared encoder and, through a simple depth refinement decoder, improve the ability of the shared encoder to capture subtle depth-related information in the images. Experiments on the DIS-5K dataset show that PDFNet significantly outperforms state-of-the-art non-diffusion methods. Due to the incorporation of the depth integrity-prior, PDFNet achieves or even surpassing the performance of the latest diffusion-based methods while using less than 11% of the parameters of diffusion-based methods. The source code at this https URL.

Title: ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning

Authors: Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06101
Pdf URL: https://arxiv.org/pdf/2503.06101
Copy Paste: [[2503.06101]] ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning(https://arxiv.org/abs/2503.06101)
Keywords: robust
Abstract: Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with simple architecture, contributing to the development of advanced and automated RL systems.

Title: Handwritten Digit Recognition: An Ensemble-Based Approach for Superior Performance

Authors: Syed Sajid Ullah, Li Gang, Mudassir Riaz, Ahsan Ashfaq, Salman Khan, Sajawal Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06104
Pdf URL: https://arxiv.org/pdf/2503.06104
Copy Paste: [[2503.06104]] Handwritten Digit Recognition: An Ensemble-Based Approach for Superior Performance(https://arxiv.org/abs/2503.06104)
Keywords: robust, extraction
Abstract: Handwritten digit recognition remains a fundamental challenge in computer vision, with applications ranging from postal code reading to document digitization. This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness. We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images. Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%. We also explore the effectiveness of data augmentation and various ensemble techniques in enhancing model performance. Our results demonstrate that this approach not only achieves high accuracy but also shows improved generalization across diverse handwriting styles. The findings contribute to the development of more reliable handwritten digit recognition systems and highlight the potential of combining deep learning with traditional machine learning methods in pattern recognition tasks.

Title: AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning

Authors: Hoang-Thang Ta, Anh Tran
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.06112
Pdf URL: https://arxiv.org/pdf/2503.06112
Copy Paste: [[2503.06112]] AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning(https://arxiv.org/abs/2503.06112)
Keywords: extraction
Abstract: Kolmogorov-Arnold Networks (KANs) have inspired numerous works exploring their applications across a wide range of scientific problems, with the potential to replace Multilayer Perceptrons (MLPs). While many KANs are designed using basis and polynomial functions, such as B-splines, ReLU-KAN utilizes a combination of ReLU functions to mimic the structure of B-splines and take advantage of ReLU's speed. However, ReLU-KAN is not built for multiple inputs, and its limitations stem from ReLU's handling of negative values, which can restrict feature extraction. To address these issues, we introduce Activation Function-Based Kolmogorov-Arnold Networks (AF-KAN), expanding ReLU-KAN with various activations and their function combinations. This novel KAN also incorporates parameter reduction methods, primarily attention mechanisms and data normalization, to enhance performance on image classification datasets. We explore different activation functions, function combinations, grid sizes, and spline orders to validate the effectiveness of AF-KAN and determine its optimal configuration. In the experiments, AF-KAN significantly outperforms MLP, ReLU-KAN, and other KANs with the same parameter count. It also remains competitive even when using fewer than 6 to 10 times the parameters while maintaining the same network structure. However, AF-KAN requires a longer training time and consumes more FLOPs. The repository for this work is available at this https URL.

Title: SecureGS: Boosting the Security and Fidelity of 3D Gaussian Splatting Steganography

Authors: Xuanyu Zhang, Jiarui Meng, Zhipei Xu, Shuzhou Yang, Yanmin Wu, Ronggang Wang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06118
Pdf URL: https://arxiv.org/pdf/2503.06118
Copy Paste: [[2503.06118]] SecureGS: Boosting the Security and Fidelity of 3D Gaussian Splatting Steganography(https://arxiv.org/abs/2503.06118)
Keywords: secure, security, privacy, protect
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a premier method for 3D representation due to its real-time rendering and high-quality outputs, underscoring the critical need to protect the privacy of 3D assets. Traditional NeRF steganography methods fail to address the explicit nature of 3DGS since its point cloud files are publicly accessible. Existing GS steganography solutions mitigate some issues but still struggle with reduced rendering fidelity, increased computational demands, and security flaws, especially in the security of the geometric structure of the visualized point cloud. To address these demands, we propose a SecureGS, a secure and efficient 3DGS steganography framework inspired by Scaffold-GS's anchor point design and neural decoding. SecureGS uses a hybrid decoupled Gaussian encryption mechanism to embed offsets, scales, rotations, and RGB attributes of the hidden 3D Gaussian points in anchor point features, retrievable only by authorized users through privacy-preserving neural networks. To further enhance security, we propose a density region-aware anchor growing and pruning strategy that adaptively locates optimal hiding regions without exposing hidden information. Extensive experiments show that SecureGS significantly surpasses existing GS steganography methods in rendering fidelity, speed, and security.

Title: Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction

Authors: Shinichi Tanaka, Zhao Wang, Yoichi Kato, Jun Ohya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06119
Pdf URL: https://arxiv.org/pdf/2503.06119
Copy Paste: [[2503.06119]] Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction(https://arxiv.org/abs/2503.06119)
Keywords: diffusion
Abstract: In this paper, we propose a unified framework that leverages a single pretrained LLM for Motion-related Multimodal Generation, referred to as MoMug. MoMug integrates diffusion-based continuous motion generation with the model's inherent autoregressive discrete text prediction capabilities by fine-tuning a pretrained LLM. This enables seamless switching between continuous motion output and discrete text token prediction within a single model architecture, effectively combining the strengths of both diffusion- and LLM-based approaches. Experimental results show that, compared to the most recent LLM-based baseline, MoMug improves FID by 38% and mean accuracy across seven metrics by 16.61% on the text-to-motion task. Additionally, it improves mean accuracy across eight metrics by 8.44% on the text-to-motion task. To the best of our knowledge, this is the first approach to integrate diffusion- and LLM-based generation within a single model for motion-related multimodal tasks while maintaining low training costs. This establishes a foundation for future advancements in motion-related generation, paving the way for high-quality yet cost-efficient motion synthesis.

Title: BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling

Authors: Li weile, Liu Xiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06121
Pdf URL: https://arxiv.org/pdf/2503.06121
Copy Paste: [[2503.06121]] BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling(https://arxiv.org/abs/2503.06121)
Keywords: transformer, large language model
Abstract: Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at this https URL.

Title: USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Authors: Xiangxiang Chu, Renda Li, Yong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06132
Pdf URL: https://arxiv.org/pdf/2503.06132
Copy Paste: [[2503.06132]] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding(https://arxiv.org/abs/2503.06132)
Keywords: diffusion
Abstract: Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at this https URL.

Title: X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

Authors: Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06134
Pdf URL: https://arxiv.org/pdf/2503.06134
Copy Paste: [[2503.06134]] X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation(https://arxiv.org/abs/2503.06134)
Keywords: diffusion, transformer, large language model
Abstract: Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: this https URL.

Title: GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Authors: Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06136
Pdf URL: https://arxiv.org/pdf/2503.06136
Copy Paste: [[2503.06136]] GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation(https://arxiv.org/abs/2503.06136)
Keywords: robust, diffusion
Abstract: Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

Title: GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs

Authors: Mingyang Song, Mao Zheng, Xuan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06139
Pdf URL: https://arxiv.org/pdf/2503.06139
Copy Paste: [[2503.06139]] GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs(https://arxiv.org/abs/2503.06139)
Keywords: large language model
Abstract: Using Large Language Models (LLMs) to evaluate and compare two answers from different models typically involves having LLM-based judges select the better answer. However, humans often approach problem-solving from a reverse perspective, for instance, by choosing the worse option instead of the better one in a pairwise comparison. Generally, this kind of reverse thinking plays a crucial role in human reasoning and decision-making and can further test the difference between original and reverse thought processes simultaneously. To address the above issue, in this paper, we propose a Goal-Reversed Prompting (GRP) approach for pairwise evaluation that shifts the original task from selecting the better answer to choosing the worse one. We encourage LLMs to think in reverse by prompting LLMs to identify the worse response. Experiments on closed-source models demonstrate that GRP significantly enhances evaluation capabilities, outperforming the prompt template with the original goal.

Title: Boosting the Local Invariance for Better Adversarial Transferability

Authors: Bohan Liu, Xiaosen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06140
Pdf URL: https://arxiv.org/pdf/2503.06140
Copy Paste: [[2503.06140]] Boosting the Local Invariance for Better Adversarial Transferability(https://arxiv.org/abs/2503.06140)
Keywords: defense, attack
Abstract: Transfer-based attacks pose a significant threat to real-world applications by directly targeting victim models with adversarial examples generated on surrogate models. While numerous approaches have been proposed to enhance adversarial transferability, existing works often overlook the intrinsic relationship between adversarial perturbations and input images. In this work, we find that adversarial perturbation often exhibits poor translation invariance for a given clean image and model, which is attributed to local invariance. Through empirical analysis, we demonstrate that there is a positive correlation between the local invariance of adversarial perturbations w.r.t. the input image and their transferability across different models. Based on this finding, we propose a general adversarial transferability boosting technique called Local Invariance Boosting approach (LI-Boost). Extensive experiments on the standard ImageNet dataset demonstrate that LI-Boost could significantly boost various types of transfer-based attacks (e.g., gradient-based, input transformation-based, model-related, advanced objective function, ensemble, etc.) on CNNs, ViTs, and defense mechanisms. Our approach presents a promising direction for future research in improving adversarial transferability across different models.

Title: Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

Authors: Mingxing Li, Rui Wang, Lei Sun, Yancheng Bai, Xiangxiang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06141
Pdf URL: https://arxiv.org/pdf/2503.06141
Copy Paste: [[2503.06141]] Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model(https://arxiv.org/abs/2503.06141)
Keywords: interpretability, large language model
Abstract: The rapid expansion of mobile internet has resulted in a substantial increase in user-generated content (UGC) images, thereby making the thorough assessment of UGC images both urgent and essential. Recently, multimodal large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA). Despite this progress, effectively scoring the quality and aesthetics of UGC images still faces two main challenges: 1) A single score is inadequate to capture the hierarchical human perception. 2) How to use MLLMs to output numerical scores, such as mean opinion scores (MOS), remains an open question. To address these challenges, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715 UGC images, each of which is annoted with 10 fine-grained attributes. These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g., composition). Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. Furthermore, with the help of chain of thought (CoT) combined with the learnt fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpretability and show strong zero-shot generalization for video quality assessment (VQA). The code and dataset will be released.

Title: VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models

Authors: Xinan He, Yue Zhou, Bing Fan, Bin Li, Guopu Zhu, Feng Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06142
Pdf URL: https://arxiv.org/pdf/2503.06142
Copy Paste: [[2503.06142]] VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models(https://arxiv.org/abs/2503.06142)
Keywords: diffusion, large language model
Abstract: Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between Visual and Language modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.

Title: Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT Networks

Authors: Xiaohong Yang, Minghui Liwang, Liqun Fu, Yuhan Su, Seyyedali Hosseinalipour, Xianbin Wang, Yiguang Hong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06145
Pdf URL: https://arxiv.org/pdf/2503.06145
Copy Paste: [[2503.06145]] Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT Networks(https://arxiv.org/abs/2503.06145)
Keywords: robust, federate
Abstract: Hierarchical Federated Learning (HFL) introduces intermediate aggregation layers, addressing the limitations of conventional Federated Learning (FL) in geographically dispersed environments with limited communication infrastructure. An application of HFL is in smart IoT systems, such as remote monitoring, disaster response, and battlefield operations, where cellular connectivity is often unreliable or unavailable. In these scenarios, UAVs serve as mobile aggregators, providing connectivity to the terrestrial IoT devices. This paper studies an HFL architecture for energy-constrained UAVs in smart IoT systems, pioneering a solution to minimize global training cost increased caused by UAV disconnection. In light of this, we formulate a joint optimization problem involving learning configuration, bandwidth allocation, and device-to-UAV association, and perform global aggregation in time before UAV drops disconnect and redeployment of UAVs. The problem explicitly accounts for the dynamic nature of IoT devices and their interruptible communications and is unveiled to be NP-hard. To address this, we decompose it into three subproblems. First, we optimize the learning configuration and bandwidth allocation using an augmented Lagrangian function to reduce training costs. Second, we propose a device fitness score, integrating data heterogeneity (via Kullback-Leibler divergence), device-to-UAV distances, and IoT device resources, and develop a twin-delayed deep deterministic policy gradient (TD3)-based algorithm for dynamic device-to-UAV assignment. Third, We introduce a low-complexity two-stage greedy strategy for finding the location of UAVs redeployment and selecting the appropriate global aggregator UAV. Experiments on real-world datasets demonstrate significant cost reductions and robust performance under communication interruptions.

Title: Do Fairness Interventions Come at the Cost of Privacy: Evaluations for Binary Classifiers

Authors: Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06150
Pdf URL: https://arxiv.org/pdf/2503.06150
Copy Paste: [[2503.06150]] Do Fairness Interventions Come at the Cost of Privacy: Evaluations for Binary Classifiers(https://arxiv.org/abs/2503.06150)
Keywords: security, privacy, attack, membership infer, fair
Abstract: While in-processing fairness approaches show promise in mitigating biased predictions, their potential impact on privacy leakage remains under-explored. We aim to address this gap by assessing the privacy risks of fairness-enhanced binary classifiers via membership inference attacks (MIAs) and attribute inference attacks (AIAs). Surprisingly, our results reveal that enhancing fairness does not necessarily lead to privacy compromises. For example, these fairness interventions exhibit increased resilience against MIAs and AIAs. This is because fairness interventions tend to remove sensitive information among extracted features and reduce confidence scores for the majority of training data for fairer predictions. However, during the evaluations, we uncover a potential threat mechanism that exploits prediction discrepancies between fair and biased models, leading to advanced attack results for both MIAs and AIAs. This mechanism reveals potent vulnerabilities of fair models and poses significant privacy risks of current fairness methods. Extensive experiments across multiple datasets, attack methods, and representative fairness approaches confirm our findings and demonstrate the efficacy of the uncovered mechanism. Our study exposes the under-explored privacy threats in fairness studies, advocating for thorough evaluations of potential security vulnerabilities before model deployments.

Title: BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis

Authors: Zixi Kang, Xinghan Wang, Yadong Mu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06151
Pdf URL: https://arxiv.org/pdf/2503.06151
Copy Paste: [[2503.06151]] BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis(https://arxiv.org/abs/2503.06151)
Keywords: diffusion
Abstract: Human motion generation holds significant promise in fields such as animation, film production, and robotics. However, existing methods often fail to produce physically plausible movements that adhere to biomechanical principles. While recent autoregressive and diffusion models have improved visual quality, they frequently overlook essential biodynamic features, such as muscle activation patterns and joint coordination, leading to motions that either violate physical laws or lack controllability. This paper introduces BioMoDiffuse, a novel biomechanics-aware diffusion framework that addresses these limitations. It features three key innovations: (1) A lightweight biodynamic network that integrates muscle electromyography (EMG) signals and kinematic features with acceleration constraints, (2) A physics-guided diffusion process that incorporates real-time biomechanical verification via modified Euler-Lagrange equations, and (3) A decoupled control mechanism that allows independent regulation of motion speed and semantic context. We also propose a set of comprehensive evaluation protocols that combines traditional metrics (FID, R-precision, etc.) with new biomechanical criteria (smoothness, foot sliding, floating, etc.). Our approach bridges the gap between data-driven motion synthesis and biomechanical authenticity, establishing new benchmarks for physically accurate motion generation.

Title: UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Authors: Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, Yong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06157
Pdf URL: https://arxiv.org/pdf/2503.06157
Copy Paste: [[2503.06157]] UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces(https://arxiv.org/abs/2503.06157)
Keywords: large language model
Abstract: Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.

Title: Invariant Federated Learning: A Novel Approach to Addressing Challenges in Federated Learning for Edge Intelligence

Authors: Ziruo Hao, Zhenhua Cui, Tao Yang, Bo Hu, Xiaofeng Wu, Hui Feng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06158
Pdf URL: https://arxiv.org/pdf/2503.06158
Copy Paste: [[2503.06158]] Invariant Federated Learning: A Novel Approach to Addressing Challenges in Federated Learning for Edge Intelligence(https://arxiv.org/abs/2503.06158)
Keywords: privacy, protect, robust, federate
Abstract: Federated learning (FL) has become a crucial solution for distributed learning in edge intelligence, addressing communication constraints and privacy protection. However, challenges such as heterogeneous and asynchronous clients significantly impact model performance. This paper analyzes the harm of abnormal clients through parameter orthogonal decomposition innovatively and shows that the exit of abnormal clients can guarantee the effect of the model in most clients. To ensure the models' performance on exited abnormal clients and those who lack training resources, we also introduce a Federated Learning with Invariant Penalty for Generalization (FedIPG). With the assistance of the invariant penalty term, the model can achieve robust generalization capability. This approach indirectly mitigates the effects of data heterogeneity and asynchrony without additional communication overhead, making it ideal for edge intelligence systems. Our theoretical and empirical results demonstrate that FedIPG, combined with an exit strategy, enhances both in-distribution performance and out-of-distribution generalization capabilities while maintaining model convergence. This approach provides a robust framework for federated learning in resource-constrained environments while offering preliminary causal insights.

Title: Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction

Authors: Kai Li, Junhao Wang, William Han, Ding Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06161
Pdf URL: https://arxiv.org/pdf/2503.06161
Copy Paste: [[2503.06161]] Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction(https://arxiv.org/abs/2503.06161)
Keywords: segmentation
Abstract: Minimally invasive surgery (MIS) has transformed clinical practice by reducing recovery times, minimizing complications, and enhancing precision. Nonetheless, MIS inherently relies on indirect visualization and precise instrument control, posing unique challenges. Recent advances in artificial intelligence have enabled real-time surgical scene understanding through techniques such as image classification, object detection, and segmentation, with scene reconstruction emerging as a key element for enhanced intraoperative guidance. Although neural radiance fields (NeRFs) have been explored for this purpose, their substantial data requirements and slow rendering inhibit real-time performance. In contrast, 3D Gaussian Splatting (3DGS) offers a more efficient alternative, achieving state-of-the-art performance in dynamic surgical scene reconstruction. In this work, we introduce Feature-EndoGaussian (FEG), an extension of 3DGS that integrates 2D segmentation cues into 3D rendering to enable real-time semantic and scene reconstruction. By leveraging pretrained segmentation foundation models, FEG incorporates semantic feature distillation within the Gaussian deformation framework, thereby enhancing both reconstruction fidelity and segmentation accuracy. On the EndoNeRF dataset, FEG achieves superior performance (SSIM of 0.97, PSNR of 39.08, and LPIPS of 0.03) compared to leading methods. Additionally, on the EndoVis18 dataset, FEG demonstrates competitive class-wise segmentation metrics while balancing model size and real-time performance.

Title: Secure On-Device Video OOD Detection Without Backpropagation

Authors: Li Li, Peilin Cai, Yuxiao Zhou, Zhiyu Ni, Renjie Liang, You Qin, Yi Nian, Zhengzhong Tu, Xiyang Hu, Yue Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06166
Pdf URL: https://arxiv.org/pdf/2503.06166
Copy Paste: [[2503.06166]] Secure On-Device Video OOD Detection Without Backpropagation(https://arxiv.org/abs/2503.06166)
Keywords: secure, privacy, federate
Abstract: Out-of-Distribution (OOD) detection is critical for ensuring the reliability of machine learning models in safety-critical applications such as autonomous driving and medical diagnosis. While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices. To overcome these challenges, we propose SecDOOD, a secure cloud-device collaboration framework for efficient on-device OOD detection without requiring device-side backpropagation. SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. Central to SecDOOD is a HyperNetwork-based personalized parameter generation module, which adapts cloud-trained models to device-specific distributions by dynamically generating local weight adjustments, effectively combining central and local information without local fine-tuning. Additionally, our dynamic feature sampling and encryption strategy selectively encrypts only the most informative feature channels, largely reducing encryption overhead without compromising detection performance. Extensive experiments across multiple datasets and OOD scenarios demonstrate that SecDOOD achieves performance comparable to fully fine-tuned models, enabling secure, efficient, and personalized OOD detection on resource-limited edge devices. To enhance accessibility and reproducibility, our code is publicly available at this https URL.

Title: Treble Counterfactual VLMs: A Causal Approach to Hallucination

Authors: Li Li, Jiashu Qu, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, Yue Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06169
Pdf URL: https://arxiv.org/pdf/2503.06169
Copy Paste: [[2503.06169]] Treble Counterfactual VLMs: A Causal Approach to Hallucination(https://arxiv.org/abs/2503.06169)
Keywords: robust
Abstract: Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at this https URL.

Title: ROCM: RLHF on consistency models

Authors: Shivanshu Shekhar, Tong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06171
Pdf URL: https://arxiv.org/pdf/2503.06171
Copy Paste: [[2503.06171]] ROCM: RLHF on consistency models(https://arxiv.org/abs/2503.06171)
Keywords: diffusion, generative
Abstract: Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

Title: FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion

Authors: Antonio Alliegro, Francesca Pistilli, Tatiana Tommasi, Giuseppe Averta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06182
Pdf URL: https://arxiv.org/pdf/2503.06182
Copy Paste: [[2503.06182]] FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion(https://arxiv.org/abs/2503.06182)
Keywords: diffusion
Abstract: Forecasting human-environment interactions in daily activities is challenging due to the high variability of human behavior. While predicting directly from videos is possible, it is limited by confounding factors like irrelevant objects or background noise that do not contribute to the interaction. A promising alternative is using Scene Graphs (SGs) to track only the relevant elements. However, current methods for forecasting future SGs face significant challenges and often rely on unrealistic assumptions, such as fixed objects over time, limiting their applicability to long-term activities where interacted objects may appear or disappear. In this paper, we introduce FORESCENE, a novel framework for Scene Graph Anticipation (SGA) that predicts both object and relationship evolution over time. FORESCENE encodes observed video segments into a latent representation using a tailored Graph Auto-Encoder and forecasts future SGs using a Latent Diffusion Model (LDM). Our approach enables continuous prediction of interaction dynamics without making assumptions on the graph's content or structure. We evaluate FORESCENE on the Action Genome dataset, where it outperforms existing SGA methods while solving a significantly more complex task.

Title: Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

Authors: Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello
Subjects: cs.LG, cs.AI, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2503.06183
Pdf URL: https://arxiv.org/pdf/2503.06183
Copy Paste: [[2503.06183]] Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers(https://arxiv.org/abs/2503.06183)
Keywords: transformer
Abstract: The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Title: Sample-aware Adaptive Structured Pruning for Large Language Models

Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06184
Pdf URL: https://arxiv.org/pdf/2503.06184
Copy Paste: [[2503.06184]] Sample-aware Adaptive Structured Pruning for Large Language Models(https://arxiv.org/abs/2503.06184)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20\% pruning ratio, the model pruned with AdaPruner maintains 97\% of the performance of the unpruned model.

Title: PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model

Authors: Xiang Gao, Shuai Yang, Jiaying Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06186
Pdf URL: https://arxiv.org/pdf/2503.06186
Copy Paste: [[2503.06186]] PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model(https://arxiv.org/abs/2503.06186)
Keywords: diffusion
Abstract: Optical illusion hidden picture is an interesting visual perceptual phenomenon where an image is cleverly integrated into another picture in a way that is not immediately obvious to the viewer. Established on the off-the-shelf text-to-image (T2I) diffusion model, we propose a novel training-free text-guided image-to-image (I2I) translation framework dubbed as \textbf{P}hase-\textbf{T}ransferred \textbf{Diffusion} Model (PTDiffusion) for hidden art syntheses. PTDiffusion embeds an input reference image into arbitrary scenes as described by the text prompts, while exhibiting hidden visual cues of the reference image. At the heart of our method is a plug-and-play phase transfer mechanism that dynamically and progressively transplants diffusion features' phase spectrum from the denoising process to reconstruct the reference image into the one to sample the generated illusion image, realizing harmonious fusion of the reference structural information and the textual semantic information. Furthermore, we propose asynchronous phase transfer to enable flexible control to the degree of hidden content discernability. Our method bypasses any model training and fine-tuning, all while substantially outperforming related methods in image quality, text fidelity, visual discernibility, and contextual naturalness for illusion picture synthesis, as demonstrated by extensive qualitative and quantitative experiments.

Title: Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks

Authors: Daryna Oliynyk, Rudolf Mayer, Andreas Rauber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06188
Pdf URL: https://arxiv.org/pdf/2503.06188
Copy Paste: [[2503.06188]] Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks(https://arxiv.org/abs/2503.06188)
Keywords: attack, steal, data-free
Abstract: Machine learning models were shown to be vulnerable to model stealing attacks, which lead to intellectual property infringement. Among other methods, substitute model training is an all-encompassing attack applicable to any machine learning model whose behaviour can be approximated from input-output queries. Whereas prior works mainly focused on improving the performance of substitute models by, e.g. developing a new substitute training method, there have been only limited ablation studies on the impact the attacker's strength has on the substitute model's performance. As a result, different authors came to diverse, sometimes contradicting, conclusions. In this work, we exhaustively examine the ambivalent influence of different factors resulting from varying the attacker's capabilities and knowledge on a substitute training attack. Our findings suggest that some of the factors that have been considered important in the past are, in fact, not that influential; instead, we discover new correlations between attack conditions and success rate. In particular, we demonstrate that better-performing target models enable higher-fidelity attacks and explain the intuition behind this phenomenon. Further, we propose to shift the focus from the complexity of target models toward the complexity of their learning tasks. Therefore, for the substitute model, rather than aiming for a higher architecture complexity, we suggest focusing on getting data of higher complexity and an appropriate architecture. Finally, we demonstrate that even in the most limited data-free scenario, there is no need to overcompensate weak knowledge with millions of queries. Our results often exceed or match the performance of previous attacks that assume a stronger attacker, suggesting that these stronger attacks are likely endangering a model owner's intellectual property to a significantly higher degree than shown until now.

Title: NeuroADDA: Active Discriminative Domain Adaptation in Connectomic

Authors: Shashata Sawmya, Thomas L. Athey, Gwyneth Liu, Nir Shavit
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06196
Pdf URL: https://arxiv.org/pdf/2503.06196
Copy Paste: [[2503.06196]] NeuroADDA: Active Discriminative Domain Adaptation in Connectomic(https://arxiv.org/abs/2503.06196)
Keywords: segmentation
Abstract: Training segmentation models from scratch has been the standard approach for new electron microscopy connectomics datasets. However, leveraging pretrained models from existing datasets could improve efficiency and performance in constrained annotation budget. In this study, we investigate domain adaptation in connectomics by analyzing six major datasets spanning different organisms. We show that, Maximum Mean Discrepancy (MMD) between neuron image distributions serves as a reliable indicator of transferability, and identifies the optimal source domain for transfer learning. Building on this, we introduce NeuroADDA, a method that combines optimal domain selection with source-free active learning to effectively adapt pretrained backbones to a new dataset. NeuroADDA consistently outperforms training from scratch across diverse datasets and fine-tuning sample sizes, with the largest gain observed at $n=4$ samples with a 25-67\% reduction in Variation of Information. Finally, we show that our analysis of distributional differences among neuron images from multiple species in a learned feature space reveals that these domain "distances" correlate with phylogenetic distance among those species.

Title: Removing Multiple Hybrid Adverse Weather in Video via a Unified Model

Authors: Yecong Wan, Mingwen Shao, Yuanshuo Cheng, Jun Shu, Shuigen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06200
Pdf URL: https://arxiv.org/pdf/2503.06200
Copy Paste: [[2503.06200]] Removing Multiple Hybrid Adverse Weather in Video via a Unified Model(https://arxiv.org/abs/2503.06200)
Keywords: robust
Abstract: Videos captured under real-world adverse weather conditions typically suffer from uncertain hybrid weather artifacts with heterogeneous degradation distributions. However, existing algorithms only excel at specific single degradation distributions due to limited adaption capacity and have to deal with different weather degradations with separately trained models, thus may fail to handle real-world stochastic weather scenarios. Besides, the model training is also infeasible due to the lack of paired video data to characterize the coexistence of multiple weather. To ameliorate the aforementioned issue, we propose a novel unified model, dubbed UniWRV, to remove multiple heterogeneous video weather degradations in an all-in-one fashion. Specifically, to tackle degenerate spatial feature heterogeneity, we propose a tailored weather prior guided module that queries exclusive priors for different instances as prompts to steer spatial feature characterization. To tackle degenerate temporal feature heterogeneity, we propose a dynamic routing aggregation module that can automatically select optimal fusion paths for different instances to dynamically integrate temporal features. Additionally, we managed to construct a new synthetic video dataset, termed HWVideo, for learning and benchmarking multiple hybrid adverse weather removal, which contains 15 hybrid weather conditions with a total of 1500 adverse-weather/clean paired video clips. Real-world hybrid weather videos are also collected for evaluating model generalizability. Comprehensive experiments demonstrate that our UniWRV exhibits robust and superior adaptation capability in multiple heterogeneous degradations learning scenarios, including various generic video restoration tasks beyond weather removal.

Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Authors: Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.06201
Pdf URL: https://arxiv.org/pdf/2503.06201
Copy Paste: [[2503.06201]] Explainable Synthetic Image Detection through Diffusion Timestep Ensembling(https://arxiv.org/abs/2503.06201)
Keywords: security, diffusion
Abstract: Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we reveal that natural and synthetic images exhibit distinct differences in the high-frequency domains of their Fourier power spectra after undergoing iterative noise perturbations through an inverse multi-step denoising process, suggesting that such noise can provide additional discriminative information for identifying synthetic images. Based on this observation, we propose a novel detection method that amplifies these differences by progressively adding noise to the original images across multiple timesteps, and train an ensemble of classifiers on these noised images. To enhance human comprehension, we introduce an explanation generation and refinement module to identify flaws located in AI-generated images. Additionally, we construct two new datasets, GenHard and GenExplain, derived from the GenImage benchmark, providing detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and harder samples, increasing a minimal of 2.51% and 3.46% compared to baselines. Furthermore, our method also generalizes effectively to images generated by other diffusion models. Our code and datasets will be made publicly available.

Title: CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

Authors: Oriel Perets, Ofir Ben Shoham, Nir Grinberg, Nadav Rappoport
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06204
Pdf URL: https://arxiv.org/pdf/2503.06204
Copy Paste: [[2503.06204]] CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset(https://arxiv.org/abs/2503.06204)
Keywords: extraction, large language model
Abstract: Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses. Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of real-world patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,562 real-world case reports from BMC, including diagnoses in open-ended textual format and as multiple-choice options with distractors. Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available. Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively, highlighting the potential of LLMs to aid in early diagnosis in real-world cases. CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.

Title: Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma

Authors: Ruiyu Wang, Sen Wang, Xinxin Zuo, Qiang Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06213
Pdf URL: https://arxiv.org/pdf/2503.06213
Copy Paste: [[2503.06213]] Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma(https://arxiv.org/abs/2503.06213)
Keywords: large language model
Abstract: Lifelong learning (LL) aims to continuously acquire new knowledge while retaining previously learned knowledge. A central challenge in LL is the stability-plasticity dilemma, which requires models to balance the preservation of previous knowledge (stability) with the ability to learn new tasks (plasticity). While parameter-efficient fine-tuning (PEFT) has been widely adopted in large language models, its application to lifelong learning remains underexplored. To bridge this gap, this paper proposes AdaLL, an adapter-based framework designed to address the dilemma through a simple, universal, and effective strategy. AdaLL co-trains the backbone network and adapters under regularization constraints, enabling the backbone to capture task-invariant features while allowing the adapters to specialize in task-specific information. Unlike methods that freeze the backbone network, AdaLL incrementally enhances the backbone's capabilities across tasks while minimizing interference through backbone regularization. This architectural design significantly improves both stability and plasticity, effectively eliminating the stability-plasticity dilemma. Extensive experiments demonstrate that AdaLL consistently outperforms existing methods across various configurations, including dataset choices, task sequences, and task scales.

Title: StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

Authors: Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, Ting Cao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06220
Pdf URL: https://arxiv.org/pdf/2503.06220
Copy Paste: [[2503.06220]] StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition(https://arxiv.org/abs/2503.06220)
Keywords: extraction, transformer
Abstract: With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce \sys, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ''event-gated LLM invocation'', in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI Copilot and interactive media.

Title: Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

Authors: Meng Wang, Fan Wu, Yunchuan Qin, Ruihui Li, Zhuo Tang, Kenli Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06222
Pdf URL: https://arxiv.org/pdf/2503.06222
Copy Paste: [[2503.06222]] Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations(https://arxiv.org/abs/2503.06222)
Keywords: robust
Abstract: The vision-based semantic scene completion task aims to predict dense geometric and semantic 3D scene representations from 2D images. However, the presence of dynamic objects in the scene seriously affects the accuracy of the model inferring 3D structures from 2D images. Existing methods simply stack multiple frames of image input to increase dense scene semantic information, but ignore the fact that dynamic objects and non-texture areas violate multi-view consistency and matching reliability. To address these issues, we propose a novel method, CDScene: Vision-based Robust Semantic Scene Completion via Capturing Dynamic Representations. First, we leverage a multimodal large-scale model to extract 2D explicit semantics and align them into 3D space. Second, we exploit the characteristics of monocular and stereo depth to decouple scene information into dynamic and static features. The dynamic features contain structural relationships around dynamic objects, and the static features contain dense contextual spatial information. Finally, we design a dynamic-static adaptive fusion module to effectively extract and aggregate complementary features, achieving robust and accurate semantic scene completion in autonomous driving scenarios. Extensive experimental results on the SemanticKITTI, SSCBench-KITTI360, and SemanticKITTI-C datasets demonstrate the superiority and robustness of CDScene over existing state-of-the-art methods.

Title: Reinforced Diffuser for Red Teaming Large Vision-Language Models

Authors: Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06223
Pdf URL: https://arxiv.org/pdf/2503.06223
Copy Paste: [[2503.06223]] Reinforced Diffuser for Red Teaming Large Vision-Language Models(https://arxiv.org/abs/2503.06223)
Keywords: attack, robust, diffusion, large language model
Abstract: The rapid advancement of large Vision-Language Models (VLMs) has raised significant safety concerns, particularly regarding their vulnerability to jailbreak attacks. While existing research primarily focuses on VLMs' susceptibility to harmful instructions, this work identifies a critical yet overlooked vulnerability: current alignment mechanisms often fail to address the risks posed by toxic text continuation tasks. To investigate this issue, we propose a novel Red Team Diffuser (RTD) framework, which leverages reinforcement learning to generate red team images that effectively induce highly toxic continuations from target black-box VLMs. The RTD pipeline begins with a greedy search for high-quality image prompts that maximize the toxicity of VLM-generated sentence continuations, guided by a Large Language Model (LLM). These prompts are then used as input for the reinforcement fine-tuning of a diffusion model, which employs toxicity and alignment rewards to further amplify harmful outputs. Experimental results demonstrate the effectiveness of RTD, increasing the toxicity rate of LLaVA outputs by 10.69% on the original attack set and 8.91% on a hold-out set. Moreover, RTD exhibits strong cross-model transferability, raising the toxicity rate by 5.1% on Gemini and 26.83% on LLaMA. These findings reveal significant deficiencies in existing alignment strategies, particularly their inability to prevent harmful continuations. Our work underscores the urgent need for more robust and adaptive alignment mechanisms to ensure the safe deployment of VLMs in real-world applications.

Title: WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models

Authors: Aditya Shankar, Lydia Y. Chen, Arie van Deursen, Rihan Hai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06231
Pdf URL: https://arxiv.org/pdf/2503.06231
Copy Paste: [[2503.06231]] WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models(https://arxiv.org/abs/2503.06231)
Keywords: robust, diffusion
Abstract: Generating temporal data under constraints is critical for forecasting, imputation, and synthesis. These datasets often include auxiliary conditions that influence the values within the time series signal. Existing methods face three key challenges: (1) they fail to adapt to conditions at inference time; (2) they rely on sequential generation, which slows the generation speed; and (3) they inefficiently encode categorical features, leading to increased sparsity and input sizes. We propose WaveStitch, a novel method that addresses these challenges by leveraging denoising diffusion probabilistic models to efficiently generate accurate temporal data under given auxiliary constraints. WaveStitch overcomes these limitations by: (1) modeling interactions between constraints and signals to generalize to new, unseen conditions; (2) enabling the parallel synthesis of sequential segments with a novel "stitching" mechanism to enforce coherence across segments; and (3) encoding categorical features as compact periodic signals while preserving temporal patterns. Extensive evaluations across diverse datasets highlight WaveStitch's ability to generalize to unseen conditions during inference, achieving up to a 10x lower mean-squared-error compared to the state-of-the-art methods. Moreover, WaveStitch generates data up to 460x faster than autoregressive methods while maintaining comparable accuracy. By efficiently encoding categorical features, WaveStitch provides a robust and efficient solution for temporal data generation. Our code is open-sourced: this https URL

Title: Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

Authors: Yanjun Chen, Yirong Sun, Xinghao Chen, Jian Wang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06232
Pdf URL: https://arxiv.org/pdf/2503.06232
Copy Paste: [[2503.06232]] Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning(https://arxiv.org/abs/2503.06232)
Keywords: large language model
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks.

Title: Dynamically evolving segment anything model with continuous learning for medical image segmentation

Authors: Zhaori Liu, Mengyang Li, Hu Han, Enli Zhang, Shiguang Shan, Zhiming Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06236
Pdf URL: https://arxiv.org/pdf/2503.06236
Copy Paste: [[2503.06236]] Dynamically evolving segment anything model with continuous learning for medical image segmentation(https://arxiv.org/abs/2503.06236)
Keywords: segmentation
Abstract: Medical image segmentation is essential for clinical diagnosis, surgical planning, and treatment monitoring. Traditional approaches typically strive to tackle all medical image segmentation scenarios via one-time learning. However, in practical applications, the diversity of scenarios and tasks in medical image segmentation continues to expand, necessitating models that can dynamically evolve to meet the demands of various segmentation tasks. Here, we introduce EvoSAM, a dynamically evolving medical image segmentation model that continuously accumulates new knowledge from an ever-expanding array of scenarios and tasks, enhancing its segmentation capabilities. Extensive evaluations on surgical image blood vessel segmentation and multi-site prostate MRI segmentation demonstrate that EvoSAM not only improves segmentation accuracy but also mitigates catastrophic forgetting. Further experiments conducted by surgical clinicians on blood vessel segmentation confirm that EvoSAM enhances segmentation efficiency based on user prompts, highlighting its potential as a promising tool for clinical applications.

Title: Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

Authors: Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06252
Pdf URL: https://arxiv.org/pdf/2503.06252
Copy Paste: [[2503.06252]] Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?(https://arxiv.org/abs/2503.06252)
Keywords: large language model
Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking. To introduce structured reasoning capabilities into visual understanding models, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3\%. Our code is now public available in this https URL.

Title: MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming

Authors: Stefan Schoepf, Muhammad Zaid Hameed, Ambrish Rawat, Kieran Fraser, Giulio Zizzo, Giandomenico Cornacchia, Mark Purcell
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06253
Pdf URL: https://arxiv.org/pdf/2503.06253
Copy Paste: [[2503.06253]] MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming(https://arxiv.org/abs/2503.06253)
Keywords: security, attack
Abstract: With LLM usage rapidly increasing, their vulnerability to jailbreaks that create harmful outputs are a major security risk. As new jailbreaking strategies emerge and models are changed by fine-tuning, continuous testing for security vulnerabilities is necessary. Existing Red Teaming methods fall short in cost efficiency, attack success rate, attack diversity, or extensibility as new attack types emerge. We address these challenges with Modular And Diverse Malicious Attack MiXtures (MAD-MAX) for Automated LLM Red Teaming. MAD-MAX uses automatic assignment of attack strategies into relevant attack clusters, chooses the most relevant clusters for a malicious goal, and then combines strategies from the selected clusters to achieve diverse novel attacks with high attack success rates. MAD-MAX further merges promising attacks together at each iteration of Red Teaming to boost performance and introduces a similarity filter to prune out similar attacks for increased cost efficiency. The MAD-MAX approach is designed to be easily extensible with newly discovered attack strategies and outperforms the prominent Red Teaming method Tree of Attacks with Pruning (TAP) significantly in terms of Attack Success Rate (ASR) and queries needed to achieve jailbreaks. MAD-MAX jailbreaks 97% of malicious goals in our benchmarks on GPT-4o and Gemini-Pro compared to TAP with 66%. MAD-MAX does so with only 10.9 average queries to the target LLM compared to TAP with 23.3. WARNING: This paper contains contents which are offensive in nature.

Title: Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation

Authors: Yinuo Liu, Zenghui Yuan, Guiyao Tie, Jiawen Shi, Lichao Sun, Neil Zhenqiang Gong
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06254
Pdf URL: https://arxiv.org/pdf/2503.06254
Copy Paste: [[2503.06254]] Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation(https://arxiv.org/abs/2503.06254)
Keywords: defense, attack
Abstract: Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning capability of vision-language models (VLMs) by dynamically accessing information from external knowledge bases. In this work, we introduce \textit{Poisoned-MRAG}, the first knowledge poisoning attack on multimodal RAG systems. Poisoned-MRAG injects a few carefully crafted image-text pairs into the multimodal knowledge database, manipulating VLMs to generate the attacker-desired response to a target query. Specifically, we formalize the attack as an optimization problem and propose two cross-modal attack strategies, dirty-label and clean-label, tailored to the attacker's knowledge and goals. Our extensive experiments across multiple knowledge databases and VLMs show that Poisoned-MRAG outperforms existing methods, achieving up to 98\% attack success rate with just five malicious image-text pairs injected into the InfoSeek database (481,782 pairs). Additionally, We evaluate 4 different defense strategies, including paraphrasing, duplicate removal, structure-driven mitigation, and purification, demonstrating their limited effectiveness and trade-offs against Poisoned-MRAG. Our results highlight the effectiveness and scalability of Poisoned-MRAG, underscoring its potential as a significant threat to multimodal RAG systems.

Title: From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models

Authors: Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, Xuelong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06260
Pdf URL: https://arxiv.org/pdf/2503.06260
Copy Paste: [[2503.06260]] From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models(https://arxiv.org/abs/2503.06260)
Keywords: large language model
Abstract: Aligning large vision-language models (LVLMs) with human preferences is challenging due to the scarcity of fine-grained, high-quality, and multimodal preference data without human annotations. Existing methods relying on direct distillation often struggle with low-confidence data, leading to suboptimal performance. To address this, we propose CAREVL, a novel method for preference reward modeling by reliably using both high- and low-confidence data. First, a cluster of auxiliary expert models (textual reward models) innovatively leverages image captions as weak supervision signals to filter high-confidence data. The high-confidence data are then used to fine-tune the LVLM. Second, low-confidence data are used to generate diverse preference samples using the fine-tuned LVLM. These samples are then scored and selected to construct reliable chosen-rejected pairs for further training. CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark, demonstrating its effectiveness. The code will be released soon.

Title: Segment Anything, Even Occluded

Authors: Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06261
Pdf URL: https://arxiv.org/pdf/2503.06261
Copy Paste: [[2503.06261]] Segment Anything, Even Occluded(https://arxiv.org/abs/2503.06261)
Keywords: segmentation
Abstract: Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects. Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.

Title: Get In Video: Add Anything You Want to the Video

Authors: Shaobin Zhuang, Zhipeng Huang, Binxin Yang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Chong Sun, Zheng-Jun Zha, Chen Li, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06268
Pdf URL: https://arxiv.org/pdf/2503.06268
Copy Paste: [[2503.06268]] Get In Video: Add Anything You Want to the Video(https://arxiv.org/abs/2503.06268)
Keywords: diffusion, transformer
Abstract: Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage, yet current approaches fundamentally fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We formalize this overlooked yet critical editing paradigm as "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos. Addressing this task's dual challenges, severe training data scarcity and technical challenges in maintaining spatiotemporal coherence, we introduce three key contributions. First, we develop GetIn-1M dataset created through our automated Recognize-Track-Erase pipeline, which sequentially performs video captioning, salient instance identification, object detection, temporal tracking, and instance removal to generate high-quality video editing pairs with comprehensive annotations (reference image, tracking mask, instance prompt). Second, we present GetInVideo, a novel end-to-end framework that leverages a diffusion transformer architecture with 3D full attention to process reference images, condition videos, and masks simultaneously, maintaining temporal coherence, preserving visual identity, and ensuring natural scene interactions when integrating reference objects into videos. Finally, we establish GetInBench, the first comprehensive benchmark for Get-In-Video Editing scenario, demonstrating our approach's superior performance through extensive evaluations. Our work enables accessible, high-quality incorporation of specific real-world subjects into videos, significantly advancing personalized video editing capabilities.

Title: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Authors: Thomas Winninger, Boussad Addad, Katarzyna Kapusta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06269
Pdf URL: https://arxiv.org/pdf/2503.06269
Copy Paste: [[2503.06269]] Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models(https://arxiv.org/abs/2503.06269)
Keywords: defense, attack, interpretability, large language model
Abstract: Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at this https URL.

Title: Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.06273
Pdf URL: https://arxiv.org/pdf/2503.06273
Copy Paste: [[2503.06273]] Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations(https://arxiv.org/abs/2503.06273)
Keywords: large language model
Abstract: We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.

Title: Exploring Adversarial Transferability between Kolmogorov-arnold Networks

Authors: Songping Wang, Xinquan Yue, Yueming Lyu, Caifeng Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06276
Pdf URL: https://arxiv.org/pdf/2503.06276
Copy Paste: [[2503.06276]] Exploring Adversarial Transferability between Kolmogorov-arnold Networks(https://arxiv.org/abs/2503.06276)
Keywords: defense, attack, robust
Abstract: Kolmogorov-Arnold Networks (KANs) have emerged as a transformative model paradigm, significantly impacting various fields. However, their adversarial robustness remains less underexplored, especially across different KAN architectures. To explore this critical safety issue, we conduct an analysis and find that due to overfitting to the specific basis functions of KANs, they possess poor adversarial transferability among different KANs. To tackle this challenge, we propose AdvKAN, the first transfer attack method for KANs. AdvKAN integrates two key components: 1) a Breakthrough-Defense Surrogate Model (BDSM), which employs a breakthrough-defense training strategy to mitigate overfitting to the specific structures of KANs. 2) a Global-Local Interaction (GLI) technique, which promotes sufficient interaction between adversarial gradients of hierarchical levels, further smoothing out loss surfaces of KANs. Both of them work together to enhance the strength of transfer attack among different KANs. Extensive experimental results on various KANs and datasets demonstrate the effectiveness of AdvKAN, which possesses notably superior attack capabilities and deeply reveals the vulnerabilities of KANs. Code will be released upon acceptance.

Title: Mitigating Blockchain extractable value (BEV) threats by Distributed Transaction Sequencing in Blockchains

Authors: Xiongfei Zhao, Hou-Wan Long, Zhengzhe Li, Jiangchuan Liu, Yain-Whar Si
Subjects: cs.CR, cs.CE, cs.DC
Abstract URL: https://arxiv.org/abs/2503.06279
Pdf URL: https://arxiv.org/pdf/2503.06279
Copy Paste: [[2503.06279]] Mitigating Blockchain extractable value (BEV) threats by Distributed Transaction Sequencing in Blockchains(https://arxiv.org/abs/2503.06279)
Keywords: security, protect, attack, fair
Abstract: The rapid growth of Blockchain and Decentralized Finance (DeFi) has introduced new challenges and vulnerabilities that threaten the integrity and efficiency of the ecosystem. This study identifies critical issues such as Transaction Order Dependence (TOD), Blockchain Extractable Value (BEV), and Transaction Importance Diversity (TID), which collectively undermine the fairness and security of DeFi systems. BEV-related activities, including Sandwich attacks, Liquidations, and Transaction Replay, have emerged as significant threats, collectively generating $540.54 million in losses over 32 months across 11,289 addresses, involving 49,691 cryptocurrencies and 60,830 on-chain markets. These attacks exploit transaction mechanics to manipulate asset prices and extract value at the expense of other participants, with Sandwich attacks being particularly impactful. Additionally, the growing adoption of Blockchain in traditional finance highlights the challenge of TID, where high transaction volumes can strain systems and compromise time-sensitive operations. To address these pressing issues, we propose a novel Distributed Transaction Sequencing Strategy (DTSS), which combines forking mechanisms and the Analytic Hierarchy Process (AHP) to enforce fair and transparent transaction ordering in a decentralized manner. Our approach is further enhanced by an optimization framework and the introduction of the Normalized Allocation Disparity Metric (NADM), which ensures optimal parameter selection for transaction prioritization. Experimental evaluations demonstrate that DTSS effectively mitigates BEV risks, enhances transaction fairness, and significantly improves the security and transparency of DeFi ecosystems. This work is essential for protecting the future of decentralized finance and promoting its integration into global financial systems.

Title: Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06287
Pdf URL: https://arxiv.org/pdf/2503.06287
Copy Paste: [[2503.06287]] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding(https://arxiv.org/abs/2503.06287)
Keywords: segmentation
Abstract: Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual grounding, though they inevitably require fine-tuning and additional model components to explicitly generate bounding boxes or segmentation masks. However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities. We refer to these heads, which consistently capture object locations related to text semantics, as localization heads. Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. Surprisingly, only three out of thousands of attention heads are sufficient to achieve competitive localization performance compared to existing LVLM-based visual grounding methods that require fine-tuning. Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship, as they implicitly focus on relevant image regions to generate informative text outputs. All the source codes will be made available to the public.

Title: IteRABRe: Iterative Recovery-Aided Block Reduction

Authors: Haryo Akbarianto Wibowo, Haiyue Song, Hideki Tanaka, Masao Utiyama, Alham Fikri Aji, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06291
Pdf URL: https://arxiv.org/pdf/2503.06291
Copy Paste: [[2503.06291]] IteRABRe: Iterative Recovery-Aided Block Reduction(https://arxiv.org/abs/2503.06291)
Keywords: large language model
Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.

Title: MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering

Authors: Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06296
Pdf URL: https://arxiv.org/pdf/2503.06296
Copy Paste: [[2503.06296]] MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering(https://arxiv.org/abs/2503.06296)
Keywords: robust
Abstract: Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.

Title: ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

Authors: Qizhen Lan, Qing Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06307
Pdf URL: https://arxiv.org/pdf/2503.06307
Copy Paste: [[2503.06307]] ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation(https://arxiv.org/abs/2503.06307)
Keywords: segmentation
Abstract: Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.

Title: Text2Story: Advancing Video Storytelling with Text Guidance

Authors: Taewon Kang, Divya Kothandaraman, Ming C. Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06310
Pdf URL: https://arxiv.org/pdf/2503.06310
Copy Paste: [[2503.06310]] Text2Story: Advancing Video Storytelling with Text Guidance(https://arxiv.org/abs/2503.06310)
Keywords: diffusion
Abstract: Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives. We present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. Further, our method extends the Black-Scholes algorithm from prompt mixing for image generation to video generation, enabling controlled motion evolution through structured text conditioning. To further enhance motion continuity, we propose a semantic action representation framework to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. This integrative approach prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.

Title: GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models

Authors: Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06312
Pdf URL: https://arxiv.org/pdf/2503.06312
Copy Paste: [[2503.06312]] GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models(https://arxiv.org/abs/2503.06312)
Keywords: robust
Abstract: Earth observation (EO) data, collected from diverse sensors with varying imaging principles, present significant challenges in creating unified analytical frameworks. We present GeoLangBind, a novel agglomerative vision--language foundation model that bridges the gap between heterogeneous EO data modalities using language as a unifying medium. Our approach aligns different EO data types into a shared language embedding space, enabling seamless integration and complementary feature learning from diverse sensor data. To achieve this, we construct a large-scale multimodal image--text dataset, GeoLangBind-2M, encompassing six data modalities. GeoLangBind leverages this dataset to develop a zero-shot foundation model capable of processing arbitrary numbers of EO data channels as input. Through our designed Modality-aware Knowledge Agglomeration (MaKA) module and progressive multimodal weight merging strategy, we create a powerful agglomerative foundation model that excels in both zero-shot vision--language comprehension and fine-grained visual understanding. Extensive evaluation across 23 datasets covering multiple tasks demonstrates GeoLangBind's superior performance and versatility in EO applications, offering a robust framework for various environmental monitoring and analysis tasks. The dataset and pretrained models will be publicly available.

Title: Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection

Authors: Chandan Kumar Sah, Ankit Kumar Shaw, Xiaoli Lian, Arsalan Shahid Baig, Tuopu Wen, Kun Jiang, Mengmeng Yang, Diange Yang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.06313
Pdf URL: https://arxiv.org/pdf/2503.06313
Copy Paste: [[2503.06313]] Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection(https://arxiv.org/abs/2503.06313)
Keywords: robust, large language model, segmentation
Abstract: Autonomous vehicles (AVs) require reliable traffic sign recognition and robust lane detection capabilities to ensure safe navigation in complex and dynamic environments. This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception. For traffic sign recognition, we systematically evaluate ResNet-50, YOLOv8, and RT-DETR, achieving state-of-the-art performance of 99.8% with ResNet-50, 98.0% accuracy with YOLOv8, and achieved 96.6% accuracy in RT-DETR despite its higher computational complexity. For lane detection, we propose a CNN-based segmentation method enhanced by polynomial curve fitting, which delivers high accuracy under favorable conditions. Furthermore, we introduce a lightweight, Multimodal, LLM-based framework that directly undergoes instruction tuning using small yet diverse datasets, eliminating the need for initial pretraining. This framework effectively handles various lane types, complex intersections, and merging zones, significantly enhancing lane detection reliability by reasoning under adverse conditions. Despite constraints in available training resources, our multimodal approach demonstrates advanced reasoning capabilities, achieving a Frame Overall Accuracy (FRM) of 53.87%, a Question Overall Accuracy (QNS) of 82.83%, lane detection accuracies of 99.6% in clear conditions and 93.0% at night, and robust performance in reasoning about lane invisibility due to rain (88.4%) or road degradation (95.6%). The proposed comprehensive framework markedly enhances AV perception reliability, thus contributing significantly to safer autonomous driving across diverse and challenging road scenarios.

Title: End-to-End Action Segmentation Transformer

Authors: Tieqiao Wang, Sinisa Todorovic
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06316
Pdf URL: https://arxiv.org/pdf/2503.06316
Copy Paste: [[2503.06316]] End-to-End Action Segmentation Transformer(https://arxiv.org/abs/2503.06316)
Keywords: robust, transformer, segmentation
Abstract: Existing approaches to action segmentation use pre-computed frame features extracted by methods which have been trained on tasks that are different from action segmentation. Also, recent approaches typically use deep framewise representations that lack explicit modeling of action segments. To address these shortcomings, we introduce the first end-to-end solution to action segmentation -- End-to-End Action Segmentation Transformer (EAST). Our key contributions include: (1) a simple and efficient adapter design for effective backbone fine-tuning; (2) a segmentation-by-detection framework for leveraging action proposals initially predicted over a coarsely downsampled video toward labeling of all frames; and (3) a new action-proposal based data augmentation for robust training. EAST achieves state-of-the-art performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101. The model and corresponding code will be released.

Title: Accurate and Efficient Two-Stage Gun Detection in Video

Authors: Badhan Chandra Das, M. Hadi Amini, Yanzhao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06317
Pdf URL: https://arxiv.org/pdf/2503.06317
Copy Paste: [[2503.06317]] Accurate and Efficient Two-Stage Gun Detection in Video(https://arxiv.org/abs/2503.06317)
Keywords: transformer
Abstract: Object detection in videos plays a crucial role in advancing applications such as public safety and anomaly detection. Existing methods have explored different techniques, including CNN, deep learning, and Transformers, for object detection and video classification. However, detecting tiny objects, e.g., guns, in videos remains challenging due to their small scale and varying appearances in complex scenes. Moreover, existing video analysis models for classification or detection often perform poorly in real-world gun detection scenarios due to limited labeled video datasets for training. Thus, developing efficient methods for effectively capturing tiny object features and designing models capable of accurate gun detection in real-world videos is imperative. To address these challenges, we make three original contributions in this paper. First, we conduct an empirical study of several existing video classification and object detection methods to identify guns in videos. Our extensive analysis shows that these methods may not accurately detect guns in videos. Second, we propose a novel two-stage gun detection method. In stage 1, we train an image-augmented model to effectively classify ``Gun'' videos. To make the detection more precise and efficient, stage 2 employs an object detection model to locate the exact region of the gun within video frames for videos classified as ``Gun'' by stage 1. Third, our experimental results demonstrate that the proposed domain-specific method achieves significant performance improvements and enhances efficiency compared with existing techniques. We also discuss challenges and future research directions in gun detection tasks in computer vision.

Title: Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation

Authors: Mohit Pandey, Gopeshh Subbaraj, Artem Cherkasov, Emmanuel Bengio
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06337
Pdf URL: https://arxiv.org/pdf/2503.06337
Copy Paste: [[2503.06337]] Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation(https://arxiv.org/abs/2503.06337)
Keywords: robust, generative
Abstract: Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks.

Title: Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

Authors: Gaurav Patel, Qiang Qiu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06339
Pdf URL: https://arxiv.org/pdf/2503.06339
Copy Paste: [[2503.06339]] Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning(https://arxiv.org/abs/2503.06339)
Keywords: generative
Abstract: Machine Unlearning has recently garnered significant attention, aiming to selectively remove knowledge associated with specific data while preserving the model's performance on the remaining data. A fundamental challenge in this process is balancing effective unlearning with knowledge retention, as naive optimization of these competing objectives can lead to conflicting gradients, hindering convergence and degrading overall performance. To address this issue, we propose Learning to Unlearn while Retaining, aimed to mitigate gradient conflicts between unlearning and retention objectives. Our approach strategically avoids conflicts through an implicit gradient regularization mechanism that emerges naturally within the proposed framework. This prevents conflicting gradients between unlearning and retention, leading to effective unlearning while preserving the model's utility. We validate our approach across both discriminative and generative tasks, demonstrating its effectiveness in achieving unlearning without compromising performance on remaining data. Our results highlight the advantages of avoiding such gradient conflicts, outperforming existing methods that fail to account for these interactions.

Title: Backdoor Attacks on Discrete Graph Diffusion Models

Authors: Jiawen Wang, Samin Karim, Yuan Hong, Binghui Wang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06340
Pdf URL: https://arxiv.org/pdf/2503.06340
Copy Paste: [[2503.06340]] Backdoor Attacks on Discrete Graph Diffusion Models(https://arxiv.org/abs/2503.06340)
Keywords: security, defense, attack, steal, diffusion, generative
Abstract: Diffusion models are powerful generative models in continuous data domains such as image and video data. Discrete graph diffusion models (DGDMs) have recently extended them for graph generation, which are crucial in fields like molecule and protein modeling, and obtained the SOTA performance. However, it is risky to deploy DGDMs for safety-critical applications (e.g., drug discovery) without understanding their security vulnerabilities. In this work, we perform the first study on graph diffusion models against backdoor attacks, a severe attack that manipulates both the training and inference/generation phases in graph diffusion models. We first define the threat model, under which we design the attack such that the backdoored graph diffusion model can generate 1) high-quality graphs without backdoor activation, 2) effective, stealthy, and persistent backdoored graphs with backdoor activation, and 3) graphs that are permutation invariant and exchangeable--two core properties in graph generative models. 1) and 2) are validated via empirical evaluations without and with backdoor defenses, while 3) is validated via theoretical results.

Title: GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks

Authors: Xiao Yue, Guangzhi Qu, Lige Gan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06352
Pdf URL: https://arxiv.org/pdf/2503.06352
Copy Paste: [[2503.06352]] GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks(https://arxiv.org/abs/2503.06352)
Keywords: interpretability, generative
Abstract: One significant challenge of exploiting Graph neural networks (GNNs) in real-life scenarios is that they are always treated as black boxes, therefore leading to the requirement of interpretability. Model-level interpretations explain what patterns maximize probability of predicting to a certain class. However, existing model-level interpretation methods pose several limitations such as generating invalid explanation graphs and requiring extreme fine-tuning on hyperparameters manually. In this paper, we propose a new Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks (GIN-Graph), to generate reliable model-level explanation graphs. The implicit and likelihood-free generative adversarial networks are exploited to construct explanation graphs similar to original graphs, meanwhile maximizing the prediction probability for a certain class by adopting a novel objective function. Experimental results indicate that GIN-Graph can be easily applied to GNN models trained on a variety of graph datasets to create meaningful explanation graphs without requiring extensive fine-tuning on hyperparameters.

Title: Language Model Personalization via Reward Factorization

Authors: Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06358
Pdf URL: https://arxiv.org/pdf/2503.06358
Copy Paste: [[2503.06358]] Language Model Personalization via Reward Factorization(https://arxiv.org/abs/2503.06358)
Keywords: large language model
Abstract: Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.

Title: Adversarial Robustness of Discriminative Self-Supervised Learning in Vision

Authors: Ömer Veysel Çağatan, Ömer Faruk Tal, M. Emre Gürsoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06361
Pdf URL: https://arxiv.org/pdf/2503.06361
Copy Paste: [[2503.06361]] Adversarial Robustness of Discriminative Self-Supervised Learning in Vision(https://arxiv.org/abs/2503.06361)
Keywords: attack, robust, segmentation
Abstract: Self-supervised learning (SSL) has advanced significantly in visual representation learning, yet comprehensive evaluations of its adversarial robustness remain limited. In this study, we evaluate the adversarial robustness of seven discriminative self-supervised models and one supervised model across diverse tasks, including ImageNet classification, transfer learning, segmentation, and detection. Our findings suggest that discriminative SSL models generally exhibit better robustness to adversarial attacks compared to their supervised counterpart on ImageNet, with this advantage extending to transfer learning when using linear evaluation. However, when fine-tuning is applied, the robustness gap between SSL and supervised models narrows considerably. Similarly, this robustness advantage diminishes in segmentation and detection tasks. We also investigate how various factors might influence adversarial robustness, including architectural choices, training duration, data augmentations, and batch sizes. Our analysis contributes to the ongoing exploration of adversarial robustness in visual self-supervised representation systems.

Title: Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Authors: Umberto Cappellazzo, Minsu Kim, Stavros Petridis
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.06362
Pdf URL: https://arxiv.org/pdf/2503.06362
Copy Paste: [[2503.06362]] Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs(https://arxiv.org/abs/2503.06362)
Keywords: robust, large language model
Abstract: Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, particularly in noisy environments. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs. Prior approaches address this by compressing speech representations before feeding them into LLMs. However, higher compression ratios often lead to performance degradation, necessitating a trade-off between computational efficiency and recognition accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation based on specific computational constraints while preserving high performance. Our approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model, eliminating the need to train separate models for different compression levels. Moreover, to efficiently fine-tune the LLM, we introduce three LoRA-based Matryoshka strategies using global and scale-specific LoRA modules. Extensive evaluations on the two largest AVSR datasets demonstrate that Llama-MTSK achieves state-of-the-art results, matching or surpassing models trained independently at fixed compression levels.

Title: Generative Video Bi-flow

Authors: Chen Liu, Tobias Ritschel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06364
Pdf URL: https://arxiv.org/pdf/2503.06364
Copy Paste: [[2503.06364]] Generative Video Bi-flow(https://arxiv.org/abs/2503.06364)
Keywords: robust, diffusion, generative
Abstract: We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective of combining two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a baseline conditional diffusion but with higher speed, i.e., fewer ODE solver steps.

Title: Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

Authors: Herman Chau, Helen Jenne, Davis Brown, Jesse He, Mark Raugas, Sara Billey, Henry Kvinge
Subjects: cs.LG, cs.AI, math.CO, math.RT
Abstract URL: https://arxiv.org/abs/2503.06366
Pdf URL: https://arxiv.org/pdf/2503.06366
Copy Paste: [[2503.06366]] Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics(https://arxiv.org/abs/2503.06366)
Keywords: interpretability
Abstract: With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research-level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Title: VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings

Authors: Leonardo Scabini, Kallil M. Zielinski, Emir Konuk, Ricardo T. Fares, Lucas C. Ribas, Kevin Smith, Odemir M. Bruno
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06368
Pdf URL: https://arxiv.org/pdf/2503.06368
Copy Paste: [[2503.06368]] VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings(https://arxiv.org/abs/2503.06368)
Keywords: robust, transformer
Abstract: Texture recognition has recently been dominated by ImageNet-pre-trained deep Convolutional Neural Networks (CNNs), with specialized modifications and feature engineering required to achieve state-of-the-art (SOTA) performance. However, although Vision Transformers (ViTs) were introduced a few years ago, little is known about their texture recognition ability. Therefore, in this work, we introduce VORTEX (ViTs with Orderless and Randomized Token Encodings for Texture Recognition), a novel method that enables the effective use of ViTs for texture analysis. VORTEX extracts multi-depth token embeddings from pre-trained ViT backbones and employs a lightweight module to aggregate hierarchical features and perform orderless encoding, obtaining a better image representation for texture recognition tasks. This approach allows seamless integration with any ViT with the common transformer architecture. Moreover, no fine-tuning of the backbone is performed, since they are used only as frozen feature extractors, and the features are fed to a linear SVM. We evaluate VORTEX on nine diverse texture datasets, demonstrating its ability to achieve or surpass SOTA performance in a variety of texture analysis scenarios. By bridging the gap between texture recognition with CNNs and transformer-based architectures, VORTEX paves the way for adopting emerging transformer foundation models. Furthermore, VORTEX demonstrates robust computational efficiency when coupled with ViT backbones compared to CNNs with similar costs. The method implementation and experimental scripts are publicly available in our online repository.

Title: Spectral State Space Model for Rotation-Invariant~Visual~Representation~Learning

Authors: Sahar Dastani, Ali Bahri, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Farzad Beizaee, Milad Cheraghalikhani, Arnab Kumar Mondal, Herve Lombaert, Christian Desrosiers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06369
Pdf URL: https://arxiv.org/pdf/2503.06369
Copy Paste: [[2503.06369]] Spectral State Space Model for Rotation-Invariant~Visual~Representation~Learning(https://arxiv.org/abs/2503.06369)
Keywords: transformer
Abstract: State Space Models (SSMs) have recently emerged as an alternative to Vision Transformers (ViTs) due to their unique ability of modeling global relationships with linear complexity. SSMs are specifically designed to capture spatially proximate relationships of image patches. However, they fail to identify relationships between conceptually related yet not adjacent patches. This limitation arises from the non-causal nature of image data, which lacks inherent directional relationships. Additionally, current vision-based SSMs are highly sensitive to transformations such as rotation. Their predefined scanning directions depend on the original image orientation, which can cause the model to produce inconsistent patch-processing sequences after rotation. To address these limitations, we introduce Spectral VMamba, a novel approach that effectively captures the global structure within an image by leveraging spectral information derived from the graph Laplacian of image patches. Through spectral decomposition, our approach encodes patch relationships independently of image orientation, achieving rotation invariance with the aid of our Rotational Feature Normalizer (RFN) module. Our experiments on classification tasks show that Spectral VMamba outperforms the leading SSM models in vision, such as VMamba, while maintaining invariance to rotations and a providing a similar runtime efficiency.

Title: EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors

Authors: Tao Feng, Yunke Zhang, Huandong Wang, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06392
Pdf URL: https://arxiv.org/pdf/2503.06392
Copy Paste: [[2503.06392]] EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors(https://arxiv.org/abs/2503.06392)
Keywords: generative
Abstract: User consumption behavior data, which records individuals' online spending history at various types of stores, has been widely used in various applications, such as store recommendation, site selection, and sale forecasting. However, its high worth is limited due to deficiencies in data comprehensiveness and changes of application scenarios. Thus, generating high-quality sequential consumption data by simulating complex user consumption behaviors is of great importance to real-world applications. Two branches of existing sequence generation methods are both limited in quality. Model-based methods with simplified assumptions fail to model the complex decision process of user consumption, while data-driven methods that emulate real-world data are prone to noises, unobserved behaviors, and dynamic decision space. In this work, we propose to enhance the fidelity and trustworthiness of the data-driven Generative Adversarial Imitation Learning (GAIL) method by blending it with the Exploration and Preferential Return EPR model . The core idea of our EPR-GAIL framework is to model user consumption behaviors as a complex EPR decision process, which consists of purchase, exploration, and preference decisions. Specifically, we design the hierarchical policy function in the generator as a realization of the EPR decision process and employ the probability distributions of the EPR model to guide the reward function in the discriminator. Extensive experiments on two real-world datasets of user consumption behaviors on an online platform demonstrate that the EPR-GAIL framework outperforms the best state-of-the-art baseline by over 19\% in terms of data fidelity. Furthermore, the generated consumption behavior data can improve the performance of sale prediction and location recommendation by up to 35.29% and 11.19%, respectively, validating its advantage for practical applications.

Title: How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06394
Pdf URL: https://arxiv.org/pdf/2503.06394
Copy Paste: [[2503.06394]] How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders(https://arxiv.org/abs/2503.06394)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.

Title: Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter

Authors: Yanyu Zhu, Licheng Bai, Jintao Xu, Jiwei Tang, Hai-tao Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06397
Pdf URL: https://arxiv.org/pdf/2503.06397
Copy Paste: [[2503.06397]] Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter(https://arxiv.org/abs/2503.06397)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion-based lip-syncing generative models have demonstrated their ability to produce highly synchronized talking face videos for visual dubbing. Although these models excel at lip synchronization, they often struggle to maintain fine-grained control over facial details in generated images. In this work, we identify "lip averaging" phenomenon where the model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. This issue arises because the commonly used UNet backbone primarily integrates audio features into visual representations in the latent space via cross-attention mechanisms and multi-scale fusion, but it struggles to retain fine-grained lip details in the generated faces. To address this issue, we propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences while maintaining accurate lip synchronization. Specifically, our method comprises two primary components: (1) an Identity Perceiver module that encodes facial embeddings to align with conditioned audio features; and (2) an ID-CrossAttn module that injects facial embeddings into the generation process, enhancing model's capability of identity retention. Extensive experiments demonstrate that, at a modest training and inference cost, UnAvgLip effectively mitigates the "averaging" phenomenon in lip inpainting, significantly preserving unique facial characteristics while maintaining precise lip synchronization. Compared with the original approach, our method demonstrates significant improvements of 5% on the identity consistency metric and 2% on the SSIM metric across two benchmark datasets (HDTF and LRW).

Title: FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression

Authors: Haisheng Fu, Jie Liang, Zhenman Fang, Jingning Han
Subjects: cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06399
Pdf URL: https://arxiv.org/pdf/2503.06399
Copy Paste: [[2503.06399]] FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression(https://arxiv.org/abs/2503.06399)
Keywords: transformer
Abstract: Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules, additional residual blocks, and expanded latent channels, thus achieving enhanced compression performance. Building on this foundation, we propose a \underline{F}eature and \underline{E}ntropy-based \underline{D}istillation \underline{S}trategy (\textbf{FEDS}) that transfers key knowledge from the teacher to a lightweight student model. Specifically, we align intermediate feature representations and emphasize the most informative latent channels through an entropy-based loss. A staged training scheme refines this transfer in three phases: feature alignment, channel-level distillation, and final fine-tuning. Our student model nearly matches the teacher across Kodak (1.24\% BD-Rate increase), Tecnick (1.17\%), and CLIC (0.55\%) while cutting parameters by about 63\% and accelerating encoding/decoding by around 73\%. Moreover, ablation studies indicate that FEDS generalizes effectively to transformer-based networks. The experimental results demonstrate our approach strikes a compelling balance among compression performance, speed, and model parameters, making it well-suited for real-time or resource-limited scenarios.

Title: Consistent Image Layout Editing with Diffusion Models

Authors: Tao Xia, Yudi Zhang, Ting Liu Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06419
Pdf URL: https://arxiv.org/pdf/2503.06419
Copy Paste: [[2503.06419]] Consistent Image Layout Editing with Diffusion Models(https://arxiv.org/abs/2503.06419)
Keywords: diffusion
Abstract: Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle to edit the layout of real images. Although a few works have been proposed to tackle this problem, they either fail to adjust the layout of images, or have difficulty in preserving visual appearance of objects after the layout adjustment. To bridge this gap, this paper proposes a novel image layout editing method that can not only re-arrange a real image to a specified layout, but also can ensure the visual appearance of the objects consistent with their appearance before editing. Concretely, the proposed method consists of two key components. Firstly, a multi-concept learning scheme is used to learn the concepts of different objects from a single image, which is crucial for keeping visual consistency in the layout editing. Secondly, it leverages the semantic consistency within intermediate features of diffusion models to project the appearance information of objects to the desired regions directly. Besides, a novel initialization noise design is adopted to facilitate the process of re-arranging the layout. Extensive experiments demonstrate that the proposed method outperforms previous works in both layout alignment and visual consistency for the task of image layout editing

Title: Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Authors: Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.06424
Pdf URL: https://arxiv.org/pdf/2503.06424
Copy Paste: [[2503.06424]] Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues(https://arxiv.org/abs/2503.06424)
Keywords: generative, large language model
Abstract: Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.

Title: Federated Learning for Diffusion Models

Authors: Zihao Peng, Xijun Wang, Shengbo Chen, Hong Rao, Cong Shen
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2503.06426
Pdf URL: https://arxiv.org/pdf/2503.06426
Copy Paste: [[2503.06426]] Federated Learning for Diffusion Models(https://arxiv.org/abs/2503.06426)
Keywords: federate, diffusion, generative
Abstract: Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.

Title: Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning

Authors: Yu Jin, Jingming Liu, Zhexu Luo, Yifei Peng, Ziang Qin, Wang-Zhou Dai, Yao-Xiang Ding, Kun Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06427
Pdf URL: https://arxiv.org/pdf/2503.06427
Copy Paste: [[2503.06427]] Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning(https://arxiv.org/abs/2503.06427)
Keywords: generative
Abstract: Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at this https URL.

Title: Graph Retrieval-Augmented LLM for Conversational Recommendation Systems

Authors: Zhangchi Qiu, Linhao Luo, Zicheng Zhao, Shirui Pan, Alan Wee-Chung Liew
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.06430
Pdf URL: https://arxiv.org/pdf/2503.06430
Copy Paste: [[2503.06430]] Graph Retrieval-Augmented LLM for Conversational Recommendation Systems(https://arxiv.org/abs/2503.06430)
Keywords: large language model
Abstract: Conversational Recommender Systems (CRSs) have emerged as a transformative paradigm for offering personalized recommendations through natural language dialogue. However, they face challenges with knowledge sparsity, as users often provide brief, incomplete preference statements. While recent methods have integrated external knowledge sources to mitigate this, they still struggle with semantic understanding and complex preference reasoning. Recent Large Language Models (LLMs) demonstrate promising capabilities in natural language understanding and reasoning, showing significant potential for CRSs. Nevertheless, due to the lack of domain knowledge, existing LLM-based CRSs either produce hallucinated recommendations or demand expensive domain-specific training, which largely limits their applicability. In this work, we present G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems), a novel training-free framework that combines graph retrieval-augmented generation and in-context learning to enhance LLMs' recommendation capabilities. Specifically, G-CRS employs a two-stage retrieve-and-recommend architecture, where a GNN-based graph reasoner first identifies candidate items, followed by Personalized PageRank exploration to jointly discover potential items and similar user interactions. These retrieved contexts are then transformed into structured prompts for LLM reasoning, enabling contextually grounded recommendations without task-specific training. Extensive experiments on two public datasets show that G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.

Title: OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

Authors: Adrian Chow, Evelien Riddell, Yimu Wang, Sean Sedwards, Krzysztof Czarnecki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06435
Pdf URL: https://arxiv.org/pdf/2503.06435
Copy Paste: [[2503.06435]] OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection(https://arxiv.org/abs/2503.06435)
Keywords: robust
Abstract: Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.

Title: OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection

Authors: Yu Liu, Hao Tang, Haiqi Zhang, Jing Qin, Zechao Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.06442
Pdf URL: https://arxiv.org/pdf/2503.06442
Copy Paste: [[2503.06442]] OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection(https://arxiv.org/abs/2503.06442)
Keywords: robust
Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications. While zero-shot OOD detection, which requires no training on in-distribution (ID) data, has become feasible with the emergence of vision-language models like CLIP, existing methods primarily focus on semantic matching and fail to fully capture distributional discrepancies. To address these limitations, we propose OT-DETECTOR, a novel framework that employs Optimal Transport (OT) to quantify both semantic and distributional discrepancies between test samples and ID labels. Specifically, we introduce cross-modal transport mass and transport cost as semantic-wise and distribution-wise OOD scores, respectively, enabling more robust detection of OOD samples. Additionally, we present a semantic-aware content refinement (SaCR) module, which utilizes semantic cues from ID labels to amplify the distributional discrepancy between ID and hard OOD samples. Extensive experiments on several benchmarks demonstrate that OT-DETECTOR achieves state-of-the-art performance across various OOD detection tasks, particularly in challenging hard-OOD scenarios.

Title: CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data

Authors: Zuqing Li, Jianzhong Qi, Junhao Gan
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.06444
Pdf URL: https://arxiv.org/pdf/2503.06444
Copy Paste: [[2503.06444]] CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data(https://arxiv.org/abs/2503.06444)
Keywords: robust, diffusion, generative
Abstract: Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To address this issue, we propose CtrTab-a condition controlled diffusion model for tabular data synthesis-to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios. Through CtrTab, we inject samples with added Laplace noise as control signals to improve data diversity and show its resemblance to L2 regularization, which enhances model robustness. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with performance gap in accuracy over 80% on average. Our source code will be released upon paper publication.

Title: A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification

Authors: Basudha Pal, Siyuan (Cyan)Huang, Rama Chellappa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06451
Pdf URL: https://arxiv.org/pdf/2503.06451
Copy Paste: [[2503.06451]] A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification(https://arxiv.org/abs/2503.06451)
Keywords: robust, extraction, fair
Abstract: Person Re-identification (ReID) systems identify individuals across images or video frames and play a critical role in various real-world applications. However, many ReID methods are influenced by sensitive attributes such as gender, pose, and body mass index (BMI), which vary in uncontrolled environments, leading to biases and reduced generalization. To address this, we extend the concept of expressivity to the body recognition domain to better understand how ReID models encode these attributes. Expressivity, defined as the mutual information between feature vector representations and specific attributes, is computed using a secondary neural network that takes feature and attribute vectors as inputs. This provides a quantitative framework for analyzing the extent to which sensitive attributes are embedded in the model's representations. We apply expressivity analysis to SemReID, a state-of-the-art self-supervised ReID model, and find that BMI consistently exhibits the highest expressivity scores in the model's final layers, underscoring its dominant role in feature encoding. In the final attention layer of the trained network, the expressivity order for body attributes is BMI > Pitch > Yaw > Gender, highlighting their relative importance in learned representations. Additionally, expressivity values evolve progressively across network layers and training epochs, reflecting a dynamic encoding of attributes during feature extraction. These insights emphasize the influence of body-related attributes on ReID models and provide a systematic methodology for identifying and mitigating attribute-driven biases. By leveraging expressivity analysis, we offer valuable tools to enhance the fairness, robustness, and generalization of ReID systems in diverse real-world settings.

Title: NaviDet: Efficient Input-level Backdoor Detection on Text-to-Image Synthesis via Neuron Activation Variation

Authors: Shengfang Zhai, Jiajun Li, Yue Liu, Huanran Chen, Zhihua Tian, Wenjie Qu, Qingni Shen, Ruoxi Jia, Yinpeng Dong, Jiaheng Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.06453
Pdf URL: https://arxiv.org/pdf/2503.06453
Copy Paste: [[2503.06453]] NaviDet: Efficient Input-level Backdoor Detection on Text-to-Image Synthesis via Neuron Activation Variation(https://arxiv.org/abs/2503.06453)
Keywords: defense, attack, diffusion
Abstract: In recent years, text-to-image (T2I) diffusion models have garnered significant attention for their ability to generate high-quality images reflecting text prompts. However, their growing popularity has also led to the emergence of backdoor threats, posing substantial risks. Currently, effective defense strategies against such threats are lacking due to the diversity of backdoor targets in T2I synthesis. In this paper, we propose NaviDet, the first general input-level backdoor detection framework for identifying backdoor inputs across various backdoor targets. Our approach is based on the new observation that trigger tokens tend to induce significant neuron activation variation in the early stage of the diffusion generation process, a phenomenon we term Early-step Activation Variation. Leveraging this insight, NaviDet detects malicious samples by analyzing neuron activation variations caused by input tokens. Through extensive experiments, we demonstrate the effectiveness and efficiency of our method against various T2I backdoor attacks, surpassing existing baselines with significantly lower computational overhead. Furthermore, we rigorously demonstrate that our method remains effective against potential adaptive attacks.

Title: Privacy Protection in Prosumer Energy Management Based on Federated Learning

Authors: Yunfeng Li, Xiaolin Li Zhitao Li, Gangqiang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06455
Pdf URL: https://arxiv.org/pdf/2503.06455
Copy Paste: [[2503.06455]] Privacy Protection in Prosumer Energy Management Based on Federated Learning(https://arxiv.org/abs/2503.06455)
Keywords: privacy, protect, federate
Abstract: With the booming development of prosumers, there is an urgent need for a prosumer energy management system to take full advantage of the flexibility of prosumers and take into account the interests of other parties. However, building such a system will undoubtedly reveal users' privacy. In this paper, by solving the non-independent and identical distribution of data (Non-IID) problem in federated learning with federated cluster average(FedClusAvg) algorithm, prosumers' information can efficiently participate in the intelligent decision making of the system without revealing privacy. In the proposed FedClusAvg algorithm, each client performs cluster stratified sampling and multiple iterations. Then, the average weight of the parameters of the sub-server is determined according to the degree of deviation of the parameter from the average parameter. Finally, the sub-server multiple local iterations and updates, and then upload to the main server. The advantages of FedClusAvg algorithm are the following two parts. First, the accuracy of the model in the case of Non-IID is improved through the method of clustering and parameter weighted average. Second, local multiple iterations and three-tier framework can effectively reduce communication rounds.

Title: DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

Authors: Chengxuan Qian, Kai Han, Jingchao Wang, Zhenlong Yuan, Rui Qian, Chongwen Lyu, Jun Chen, Zhe Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06456
Pdf URL: https://arxiv.org/pdf/2503.06456
Copy Paste: [[2503.06456]] DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning(https://arxiv.org/abs/2503.06456)
Keywords: robust
Abstract: Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at this https URL.

Title: Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

Authors: Yanbiao Ma, Wei Dai, Wenke Huang, Jiayi Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06457
Pdf URL: https://arxiv.org/pdf/2503.06457
Copy Paste: [[2503.06457]] Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning(https://arxiv.org/abs/2503.06457)
Keywords: privacy, federate
Abstract: Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: this https URL

Title: Reconstructing Depth Images of Moving Objects from Wi-Fi CSI Data

Authors: Guanyu Cao, Takuya Maekawa, Kazuya Ohara, Yasue Kishino
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06458
Pdf URL: https://arxiv.org/pdf/2503.06458
Copy Paste: [[2503.06458]] Reconstructing Depth Images of Moving Objects from Wi-Fi CSI Data(https://arxiv.org/abs/2503.06458)
Keywords: security
Abstract: This study proposes a new deep learning method for reconstructing depth images of moving objects within a specific area using Wi-Fi channel state information (CSI). The Wi-Fi-based depth imaging technique has novel applications in domains such as security and elder care. However, reconstructing depth images from CSI is challenging because learning the mapping function between CSI and depth images, both of which are high-dimensional data, is particularly difficult. To address the challenge, we propose a new approach called Wi-Depth. The main idea behind the design of Wi-Depth is that a depth image of a moving object can be decomposed into three core components: the shape, depth, and position of the target. Therefore, in the depth-image reconstruction task, Wi-Depth simultaneously estimates the three core pieces of information as auxiliary tasks in our proposed VAE-based teacher-student architecture, enabling it to output images with the consistency of a correct shape, depth, and position. In addition, the design of Wi-Depth is based on our idea that this decomposition efficiently takes advantage of the fact that shape, depth, and position relate to primitive information inferred from CSI such as angle-of-arrival, time-of-flight, and Doppler frequency shift.

Title: Long-tailed Adversarial Training with Self-Distillation

Authors: Seungju Cho, Hongsin Lee, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06461
Pdf URL: https://arxiv.org/pdf/2503.06461
Copy Paste: [[2503.06461]] Long-tailed Adversarial Training with Self-Distillation(https://arxiv.org/abs/2503.06461)
Keywords: attack, robust
Abstract: Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets. Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods. In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions. Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique. Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset. Our extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets. We improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.

Title: SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

Authors: Shijia Zhao, Qiming Xia, Xusheng Guo, Pufan Zou, Maoji Zheng, Hai Wu, Chenglu Wen, Cheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06467
Pdf URL: https://arxiv.org/pdf/2503.06467
Copy Paste: [[2503.06467]] SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts(https://arxiv.org/abs/2503.06467)
Keywords: robust
Abstract: Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions. Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code is available at this https URL.

Title: CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

Authors: Yuxuan Luo, Jiaqi Tang, Chenyi Huang, Feiyang Hao, Zhouhui Lian
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.06472
Pdf URL: https://arxiv.org/pdf/2503.06472
Copy Paste: [[2503.06472]] CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model(https://arxiv.org/abs/2503.06472)
Keywords: robust, extraction
Abstract: Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's \textbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.

Title: A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation

Authors: Jiajie Fan, Amal Trigui, Andrea Bonfanti, Felix Dietrich, Thomas Bäck, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06485
Pdf URL: https://arxiv.org/pdf/2503.06485
Copy Paste: [[2503.06485]] A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation(https://arxiv.org/abs/2503.06485)
Keywords: diffusion, generative
Abstract: Recent advancements in learning latent codes derived from high-dimensional shapes have demonstrated impressive outcomes in 3D generative modeling. Traditionally, these approaches employ a trained autoencoder to acquire a continuous implicit representation of source shapes, which can be computationally expensive. This paper introduces a novel framework, spectral-domain diffusion for high-quality shape generation SpoDify, that utilizes singular value decomposition (SVD) for shape encoding. The resulting eigenvectors can be stored for subsequent decoding, while generative modeling is performed on the eigenfeatures. This approach efficiently encodes complex meshes into continuous implicit representations, such as encoding a 15k-vertex mesh to a 512-dimensional latent code without learning. Our method exhibits significant advantages in scenarios with limited samples or GPU resources. In mesh generation tasks, our approach produces high-quality shapes that are comparable to state-of-the-art methods.

Title: PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Authors: Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06486
Pdf URL: https://arxiv.org/pdf/2503.06486
Copy Paste: [[2503.06486]] PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training(https://arxiv.org/abs/2503.06486)
Keywords: large language model
Abstract: This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

Title: A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025

Authors: Rina Mishra, Gaurav Varshney
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.06487
Pdf URL: https://arxiv.org/pdf/2503.06487
Copy Paste: [[2503.06487]] A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025(https://arxiv.org/abs/2503.06487)
Keywords: security, robust
Abstract: Phishing websites continue to pose a significant security challenge, making the development of robust detection mechanisms essential. Brand Domain Identification (BDI) serves as a crucial step in many phishing detection approaches. This study systematically evaluates the effectiveness of features employed over the past decade for BDI, focusing on their weighted importance in phishing detection as of 2025. The primary objective is to determine whether the identified brand domain matches the claimed domain, utilizing popular features for phishing detection. To validate feature importance and evaluate performance, we conducted two experiments on a dataset comprising 4,667 legitimate sites and 4,561 phishing sites. In Experiment 1, we used the Weka tool to identify optimized and important feature sets out of 5: CN Information(CN), Logo Domain(LD),Form Action Domain(FAD),Most Common Link in Domain(MCLD) and Cookie Domain through its 4 Attribute Ranking Evaluator. The results revealed that none of the features were redundant, and Random Forest emerged as the best classifier, achieving an impressive accuracy of 99.7\% with an average response time of 0.08 seconds. In Experiment 2, we trained five machine learning models, including Random Forest, Decision Tree, Support Vector Machine, Multilayer Perceptron, and XGBoost to assess the performance of individual BDI features and their combinations. The results demonstrated an accuracy of 99.8\%, achieved with feature combinations of only three features: Most Common Link Domain, Logo Domain, Form Action and Most Common Link Domain,CN Info,Logo Domain using Random Forest as the best classifier. This study underscores the importance of leveraging key domain features for efficient phishing detection and paves the way for the development of real-time, scalable detection systems.

Title: VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Authors: Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06492
Pdf URL: https://arxiv.org/pdf/2503.06492
Copy Paste: [[2503.06492]] VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering(https://arxiv.org/abs/2503.06492)
Keywords: extraction
Abstract: Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at this https URL.

Title: Enhancing Malware Fingerprinting through Analysis of Evasive Techniques

Authors: Alsharif Abuadbba, Sean Lamont, Ejaz Ahmed, Cody Christopher, Muhammad Ikram, Uday Tupakula, Daniel Coscia, Mohamed Ali Kaafar, Surya Nepal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.06495
Pdf URL: https://arxiv.org/pdf/2503.06495
Copy Paste: [[2503.06495]] Enhancing Malware Fingerprinting through Analysis of Evasive Techniques(https://arxiv.org/abs/2503.06495)
Keywords: attack
Abstract: As malware detection evolves, attackers adopt sophisticated evasion tactics. Traditional file-level fingerprinting, such as cryptographic and fuzzy hashes, is often overlooked as a target for evasion. Malware variants exploit minor binary modifications to bypass detection, as seen in Microsoft's discovery of GoldMax variations (2020-2021). However, no large-scale empirical studies have assessed the limitations of traditional fingerprinting methods on real-world malware samples or explored improvements. This paper fills this gap by addressing three key questions: (a) How prevalent are file variants in malware samples? Analyzing 4 million Windows Portable Executable (PE) files, 21 million sections, and 48 million resources, we find up to 80% deep structural similarities, including common APIs and executable sections. (b) What evasion techniques are used? We identify resilient fingerprints (clusters of malware variants with high similarity) validated via VirusTotal. Our analysis reveals non-functional mutations, such as altered section numbers, virtual sizes, and section names, as primary evasion tactics. We also classify two key section types: malicious sections (high entropy >5) and camouflage sections (entropy = 0). (c) How can fingerprinting be improved? We propose two novel approaches that enhance detection, improving identification rates from 20% (traditional methods) to over 50% using our refined fingerprinting techniques. Our findings highlight the limitations of existing methods and propose new strategies to strengthen malware fingerprinting against evolving threats.

Title: Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

Authors: Enming Zhang, Peizhe Gong, Xingyuan Dai, Yisheng Lv, Qinghai Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06497
Pdf URL: https://arxiv.org/pdf/2503.06497
Copy Paste: [[2503.06497]] Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving(https://arxiv.org/abs/2503.06497)
Keywords: large language model
Abstract: Assessing the safety of vision-language models (VLMs) in autonomous driving is particularly important; however, existing work mainly focuses on traditional benchmark evaluations. As interactive components within autonomous driving systems, VLMs must maintain strong safety cognition during interactions. From this perspective, we propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) . To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text Annotation System (ADA) . Additionally, to ensure data quality in SCD-Bench, our dataset undergoes manual refinement by experts with professional knowledge in autonomous driving. We further develop an automated evaluation method based on large language models (LLMs). To verify its effectiveness, we compare its evaluation results with those of expert human evaluations, achieving a consistency rate of 99.74%. Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition, showing a significant gap compared to GPT-4o. Notably, lightweight models (1B-4B) demonstrate minimal safety cognition. However, since lightweight models are crucial for autonomous driving systems, this presents a significant challenge for integrating VLMs into the field.

Title: ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

Authors: Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He, Zhaoxin Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06499
Pdf URL: https://arxiv.org/pdf/2503.06499
Copy Paste: [[2503.06499]] ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis(https://arxiv.org/abs/2503.06499)
Keywords: diffusion
Abstract: Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fréchet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.

Title: Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

Authors: Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06506
Pdf URL: https://arxiv.org/pdf/2503.06506
Copy Paste: [[2503.06506]] Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation(https://arxiv.org/abs/2503.06506)
Keywords: generative
Abstract: Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts, such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships, integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at this https URL.

Title: HFedCKD: Toward Robust Heterogeneous Federated Learning via Data-free Knowledge Distillation and Two-way Contrast

Authors: Yiting Zheng, Bohan Lin, Jinqian Chen, Jihua Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06511
Pdf URL: https://arxiv.org/pdf/2503.06511
Copy Paste: [[2503.06511]] HFedCKD: Toward Robust Heterogeneous Federated Learning via Data-free Knowledge Distillation and Two-way Contrast(https://arxiv.org/abs/2503.06511)
Keywords: robust, federate, fair, data-free
Abstract: Most current federated learning frameworks are modeled as static processes, ignoring the dynamic characteristics of the learning system. Under the limited communication budget of the central server, the flexible model architecture of a large number of clients participating in knowledge transfer requires a lower participation rate, active clients have uneven contributions, and the client scale seriously hinders the performance of FL. We consider a more general and practical federation scenario and propose a system heterogeneous federation method based on data-free knowledge distillation and two-way contrast (HFedCKD). We apply the Inverse Probability Weighted Distillation (IPWD) strategy to the data-free knowledge transfer framework. The generator completes the data features of the nonparticipating clients. IPWD implements a dynamic evaluation of the prediction contribution of each client under different data distributions. Based on the antibiased weighting of its prediction loss, the weight distribution of each client is effectively adjusted to fairly integrate the knowledge of participating clients. At the same time, the local model is split into a feature extractor and a classifier. Through differential contrast learning, the feature extractor is aligned with the global model in the feature space, while the classifier maintains personalized decision-making capabilities. HFedCKD effectively alleviates the knowledge offset caused by a low participation rate under data-free knowledge distillation and improves the performance and stability of the model. We conduct extensive experiments on image and IoT datasets to comprehensively evaluate and verify the generalization and robustness of the proposed HFedCKD framework.

Title: GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Authors: Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, Kwonjoon Lee
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06514
Pdf URL: https://arxiv.org/pdf/2503.06514
Copy Paste: [[2503.06514]] GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks(https://arxiv.org/abs/2503.06514)
Keywords: generative
Abstract: Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Title: SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

Authors: Jing Zhang, Zhikai Li, Qingyi Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06515
Pdf URL: https://arxiv.org/pdf/2503.06515
Copy Paste: [[2503.06515]] SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model(https://arxiv.org/abs/2503.06515)
Keywords: segmentation
Abstract: Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme outliers, and we find that aggressive clipping (ranging down to even 100$\times$), instead of smoothing or isolation, is effective in suppressing outliers while maintaining semantic capabilities. Unfortunately, traditional metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing reconstruction methods potentially neglect prompts' intention, resulting in distorted visual encodings during prompt interactions. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ of SAM with semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap as clipping metric, to significantly suppress outliers. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates visual-prompt interactions by leveraging cross-attention responses in mask decoder, thus facilitating alignment in both distribution and semantics. To ensure the interaction efficiency, we also introduce a layer-skipping strategy for visual tokens. Extensive experiments are conducted on different segmentation tasks and SAMs of various sizes, and the results show that the proposed SAQ-SAM consistently outperforms baselines. For example, when quantizing SAM-B to 4-bit, our method achieves 11.7% higher mAP than the baseline in instance segmentation task.

Title: Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation

Authors: Wenhui Zhang, Huiyu Xu, Zhibo Wang, Zeqing He, Ziqi Zhu, Kui Ren
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06519
Pdf URL: https://arxiv.org/pdf/2503.06519
Copy Paste: [[2503.06519]] Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation(https://arxiv.org/abs/2503.06519)
Keywords: security, privacy, defense, attack, large language model
Abstract: Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs) due to their low computational demands, enhanced privacy guarantees and comparable performance in specific domains through light-weight fine-tuning. Deploying SLMs on edge devices, such as smartphones and smart vehicles, has become a growing trend. However, the security implications of SLMs have received less attention than LLMs, particularly regarding jailbreak attacks, which is recognized as one of the top threats of LLMs by the OWASP. In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks. Through systematically evaluation on 63 SLMs from 15 mainstream SLM families against 8 state-of-the-art jailbreak methods, we demonstrate that 47.6% of evaluated SLMs show high susceptibility to jailbreak attacks (ASR > 40%) and 38.1% of them can not even resist direct harmful query (ASR > 50%). We further analyze the reasons behind the vulnerabilities and identify four key factors: model size, model architecture, training datasets and training techniques. Moreover, we assess the effectiveness of three prompt-level defense methods and find that none of them achieve perfect performance, with detection accuracy varying across different SLMs and attack methods. Notably, we point out that the inherent security awareness play a critical role in SLM security, and models with strong security awareness could timely terminate unsafe response with little reminder. Building upon the findings, we highlight the urgent need for security-by-design approaches in SLM development and provide valuable insights for building more trustworthy SLM ecosystem.

Title: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Authors: Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.06520
Pdf URL: https://arxiv.org/pdf/2503.06520
Copy Paste: [[2503.06520]] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement(https://arxiv.org/abs/2503.06520)
Keywords: robust, segmentation
Abstract: Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at this https URL.

Title: SGA-INTERACT: A 3D Skeleton-based Benchmark for Group Activity Understanding in Modern Basketball Tactic

Authors: Yuchen Yang, Wei Wang, Yifei Liu, Linfeng Dong, Hao Wu, Mingxin Zhang, Zhihang Zhong, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06522
Pdf URL: https://arxiv.org/pdf/2503.06522
Copy Paste: [[2503.06522]] SGA-INTERACT: A 3D Skeleton-based Benchmark for Group Activity Understanding in Modern Basketball Tactic(https://arxiv.org/abs/2503.06522)
Keywords: extraction
Abstract: Group Activity Understanding is predominantly studied as Group Activity Recognition (GAR) task. However, existing GAR benchmarks suffer from coarse-grained activity vocabularies and the only data form in single-view, which hinder the evaluation of state-of-the-art algorithms. To address these limitations, we introduce SGA-INTERACT, the first 3D skeleton-based benchmark for group activity understanding. It features complex activities inspired by basketball tactics, emphasizing rich spatial interactions and long-term dependencies. SGA-INTERACT introduces Temporal Group Activity Localization (TGAL) task, extending group activity understanding to untrimmed sequences, filling the gap left by GAR as a standalone task. In addition to the benchmark, we propose One2Many, a novel framework that employs a pretrained 3D skeleton backbone for unified individual feature extraction. This framework aligns with the feature extraction paradigm in RGB-based methods, enabling direct evaluation of RGB-based models on skeleton-based benchmarks. We conduct extensive evaluations on SGA-INTERACT using two skeleton-based methods, three RGB-based methods, and a proposed baseline within the One2Many framework. The general low performance of baselines highlights the benchmark's challenges, motivating advancements in group activity understanding.

Title: AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection

Authors: Jialin Lu, Junjie Shan, Ziqi Zhao, Ka-Ho Chow
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06529
Pdf URL: https://arxiv.org/pdf/2503.06529
Copy Paste: [[2503.06529]] AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection(https://arxiv.org/abs/2503.06529)
Keywords: attack, robust
Abstract: As object detection becomes integral to many safety-critical applications, understanding its vulnerabilities is essential. Backdoor attacks, in particular, pose a serious threat by implanting hidden triggers in victim models, which adversaries can later exploit to induce malicious behaviors during inference. However, current understanding is limited to single-target attacks, where adversaries must define a fixed malicious behavior (target) before training, making inference-time adaptability impossible. Given the large output space of object detection (including object existence prediction, bounding box estimation, and classification), the feasibility of flexible, inference-time model control remains unexplored. This paper introduces AnywhereDoor, a multi-target backdoor attack for object detection. Once implanted, AnywhereDoor allows adversaries to make objects disappear, fabricate new ones, or mislabel them, either across all object classes or specific ones, offering an unprecedented degree of control. This flexibility is enabled by three key innovations: (i) objective disentanglement to scale the number of supported targets; (ii) trigger mosaicking to ensure robustness even against region-based detectors; and (iii) strategic batching to address object-level data imbalances that hinder manipulation. Extensive experiments demonstrate that AnywhereDoor grants attackers a high degree of control, improving attack success rates by 26% compared to adaptations of existing methods for such flexible control.

Title: SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations

Authors: Xingwei Tan, Chen Lyu, Hafiz Muhammad Umer, Sahrish Khan, Mahathi Parvatham, Lois Arthurs, Simon Cullen, Shelley Wilson, Arshad Jhumka, Gabriele Pergola
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06534
Pdf URL: https://arxiv.org/pdf/2503.06534
Copy Paste: [[2503.06534]] SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations(https://arxiv.org/abs/2503.06534)
Keywords: explainability, large language model
Abstract: Detecting toxic language including sexism, harassment and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce SafeSpeech, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. SafeSpeech also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.

Title: One-Step Diffusion Model for Image Motion-Deblurring

Authors: Xiaoyang Liu, Yuquan Wang, Zheng Chen, Jiezhang Cao, He Zhang, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06537
Pdf URL: https://arxiv.org/pdf/2503.06537
Copy Paste: [[2503.06537]] One-Step Diffusion Model for Image Motion-Deblurring(https://arxiv.org/abs/2503.06537)
Keywords: diffusion, transformer
Abstract: Currently, methods for single-image deblurring based on CNNs and transformers have demonstrated promising performance. However, these methods often suffer from perceptual limitations, poor generalization ability, and struggle with heavy or complex blur. While diffusion-based methods can partially address these shortcomings, their multi-step denoising process limits their practical usage. In this paper, we conduct an in-depth exploration of diffusion models in deblurring and propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step, significantly improving inference efficiency while maintaining high fidelity. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Additionally, we construct a high-quality synthetic deblurring dataset to mitigate perceptual collapse and design a dynamic dual-adapter (DDA) to enhance perceptual quality while preserving fidelity. Extensive experiments demonstrate that our method achieves strong performance on both full and no-reference metrics. Our code and pre-trained model will be publicly available at this https URL.

Title: ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Authors: Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06542
Pdf URL: https://arxiv.org/pdf/2503.06542
Copy Paste: [[2503.06542]] ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy(https://arxiv.org/abs/2503.06542)
Keywords: large language model
Abstract: Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at this https URL.

Title: QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

Authors: Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06545
Pdf URL: https://arxiv.org/pdf/2503.06545
Copy Paste: [[2503.06545]] QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation(https://arxiv.org/abs/2503.06545)
Keywords: diffusion, transformer
Abstract: Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$\times$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at this https URL.

Title: BingoGuard: LLM Content Moderation Tools with Risk Levels

Authors: Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06550
Pdf URL: https://arxiv.org/pdf/2503.06550
Copy Paste: [[2503.06550]] BingoGuard: LLM Content Moderation Tools with Risk Levels(https://arxiv.org/abs/2503.06550)
Keywords: large language model
Abstract: Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

Title: BDPFL: Backdoor Defense for Personalized Federated Learning via Explainable Distillation

Authors: Chengcheng Zhu, Jiale Zhang, Di Wu, Guodong Long
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.06554
Pdf URL: https://arxiv.org/pdf/2503.06554
Copy Paste: [[2503.06554]] BDPFL: Backdoor Defense for Personalized Federated Learning via Explainable Distillation(https://arxiv.org/abs/2503.06554)
Keywords: security, privacy, defense, attack, robust, federate
Abstract: Federated learning is a distributed learning paradigm that facilitates the collaborative training of a global model across multiple clients while preserving the privacy of local datasets. To address inherent challenges related to data heterogeneity and satisfy personalized needs, a new direction within FL, known as personalized Federated Learning (pFL), has gradually evolved. Extensive attention has been directed toward developing novel frameworks and methods to enhance the performance of pFL. Regrettably, the aspect of security in pFL has been largely overlooked. Our objective is to fill this gap. Similar to FL, pFL is susceptible to backdoor attacks. However, existing backdoor defense strategies are primarily tailored to general FL frameworks, and pFL lacks robustness against backdoor attacks. We propose a novel, backdoor-robust pFL framework named BDPFL to address these challenges. First, BDPFL introduces layer-wise mutual distillation that enables clients to learn their personalized local models while mitigating potential backdoors. Then, BDPFL employs explanation heatmap to learn high-quality intermediate representations and enhance the effect of eliminating deeper and more entrenched backdoors. Moreover, we perform empirical evaluations of BDPFL's performance on three datasets and compare BDPFL with four backdoor defense methods. The experiments demonstrate that BDPFL outperforms baseline methods and is effective under various settings.

Title: Generative modelling with jump-diffusions

Authors: Adrian Baule
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06558
Pdf URL: https://arxiv.org/pdf/2503.06558
Copy Paste: [[2503.06558]] Generative modelling with jump-diffusions(https://arxiv.org/abs/2503.06558)
Keywords: diffusion, generative
Abstract: Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution. Both probability flow ODE and SDE formulations are derived using basic technical effort, and are implemented for jump amplitudes drawn from a multivariate Laplace distribution. Remarkably, for the problem of capturing a heavy-tailed target distribution, the jump-diffusion Laplace model outperforms models driven by alpha-stable noise despite not containing any heavy-tailed characteristics. The framework can be readily applied to other jump statistics that could further improve on the performance of standard diffusion models.

Title: MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation

Authors: Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yuanhang Wang, Lizhe Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06559
Pdf URL: https://arxiv.org/pdf/2503.06559
Copy Paste: [[2503.06559]] MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation(https://arxiv.org/abs/2503.06559)
Keywords: robust
Abstract: Adversarial Robustness Distillation (ARD) is a promising task to boost the robustness of small-capacity models with the guidance of the pre-trained robust teacher. The ARD can be summarized as a min-max optimization process, i.e., synthesizing adversarial examples (inner) & training the student (outer). Although competitive robustness performance, existing ARD methods still have issues. In the inner process, the synthetic training examples are far from the teacher's decision boundary leading to important robust information missing. In the outer process, the student model is decoupled from learning natural and robust scenarios, leading to the robustness saturation, i.e., student performance is highly susceptible to customized teacher selection. To tackle these issues, this paper proposes a general Min-Max optimization Adversarial Robustness Distillation (MMARD) method. For the inner process, we introduce the teacher's robust predictions, which drive the training examples closer to the teacher's decision boundary to explore more robust knowledge. For the outer process, we propose a structured information modeling method based on triangular relationships to measure the mutual information of the model in natural and robust scenarios and enhance the model's ability to understand multi-scenario mapping relationships. Experiments show our MMARD achieves state-of-the-art performance on multiple benchmarks. Besides, MMARD is plug-and-play and convenient to combine with existing methods.

Title: TR-DQ: Time-Rotation Diffusion Quantization

Authors: Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06564
Pdf URL: https://arxiv.org/pdf/2503.06564
Copy Paste: [[2503.06564]] TR-DQ: Time-Rotation Diffusion Quantization(https://arxiv.org/abs/2503.06564)
Keywords: diffusion
Abstract: Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.

Title: Future-Aware Interaction Network For Motion Forecasting

Authors: Shijie Li, Xun Xu, Si Yong Yeo, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06565
Pdf URL: https://arxiv.org/pdf/2503.06565
Copy Paste: [[2503.06565]] Future-Aware Interaction Network For Motion Forecasting(https://arxiv.org/abs/2503.06565)
Keywords: transformer
Abstract: Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according to the acceptance.

Title: Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Authors: Yao Cheng, Yibo Zhao, Jiapeng Zhu, Yao Liu, Xing Sun, Xiang Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06567
Pdf URL: https://arxiv.org/pdf/2503.06567
Copy Paste: [[2503.06567]] Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving(https://arxiv.org/abs/2503.06567)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated transformative potential across various domains, yet they face significant challenges in knowledge integration and complex problem reasoning, often leading to hallucinations and unreliable outputs. Retrieval-Augmented Generation (RAG) has emerged as a promising solution to enhance LLMs accuracy by incorporating external knowledge. However, traditional RAG systems struggle with processing complex relational information and multi-step reasoning, limiting their effectiveness in advanced problem-solving tasks. To address these limitations, we propose CogGRAG, a cognition inspired graph-based RAG framework, designed to improve LLMs performance in Knowledge Graph Question Answering (KGQA). Inspired by the human cognitive process of decomposing complex problems and performing self-verification, our framework introduces a three-stage methodology: decomposition, retrieval, and reasoning with self-verification. By integrating these components, CogGRAG enhances the accuracy of LLMs in complex problem solving. We conduct systematic experiments with three LLM backbones on four benchmark datasets, where CogGRAG outperforms the baselines.

Title: Conceptrol: Concept Control of Zero-shot Personalized Image Generation

Authors: Qiyuan He, Angela Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06568
Pdf URL: https://arxiv.org/pdf/2503.06568
Copy Paste: [[2503.06568]] Conceptrol: Concept Control of Zero-shot Personalized Image Generation(https://arxiv.org/abs/2503.06568)
Keywords: diffusion
Abstract: Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at this https URL.

Title: Global-Aware Monocular Semantic Scene Completion with State Space Models

Authors: Shijie Li, Zhongyao Cheng, Rong Li, Shuai Li, Juergen Gall, Xun Xu, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06569
Pdf URL: https://arxiv.org/pdf/2503.06569
Copy Paste: [[2503.06569]] Global-Aware Monocular Semantic Scene Completion with State Space Models(https://arxiv.org/abs/2503.06569)
Keywords: extraction, transformer
Abstract: Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.

Title: SHIP: A Shapelet-based Approach for Interpretable Patient-Ventilator Asynchrony Detection

Authors: Xuan-May Le, Ling Luo, Uwe Aickelin, Minh-Tuan Tran, David Berlowitz, Mark Howard
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06571
Pdf URL: https://arxiv.org/pdf/2503.06571
Copy Paste: [[2503.06571]] SHIP: A Shapelet-based Approach for Interpretable Patient-Ventilator Asynchrony Detection(https://arxiv.org/abs/2503.06571)
Keywords: interpretability
Abstract: Patient-ventilator asynchrony (PVA) is a common and critical issue during mechanical ventilation, affecting up to 85% of patients. PVA can result in clinical complications such as discomfort, sleep disruption, and potentially more severe conditions like ventilator-induced lung injury and diaphragm dysfunction. Traditional PVA management, which relies on manual adjustments by healthcare providers, is often inadequate due to delays and errors. While various computational methods, including rule-based, statistical, and deep learning approaches, have been developed to detect PVA events, they face challenges related to dataset imbalances and lack of interpretability. In this work, we propose a shapelet-based approach SHIP for PVA detection, utilizing shapelets - discriminative subsequences in time-series data - to enhance detection accuracy and interpretability. Our method addresses dataset imbalances through shapelet-based data augmentation and constructs a shapelet pool to transform the dataset for more effective classification. The combined shapelet and statistical features are then used in a classifier to identify PVA events. Experimental results on medical datasets show that SHIP significantly improves PVA detection while providing interpretable insights into model decisions.

Title: Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

Authors: Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06594
Pdf URL: https://arxiv.org/pdf/2503.06594
Copy Paste: [[2503.06594]] Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation(https://arxiv.org/abs/2503.06594)
Keywords: transformer, large language model
Abstract: The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.

Title: MultiCo3D: Multi-Label Voxel Contrast for One-Shot Incremental Segmentation of 3D Neuroimages

Authors: Hao Xu, Tengfei Xue, Dongnan Liu, Yuqian Chen, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O'Donnell, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06598
Pdf URL: https://arxiv.org/pdf/2503.06598
Copy Paste: [[2503.06598]] MultiCo3D: Multi-Label Voxel Contrast for One-Shot Incremental Segmentation of 3D Neuroimages(https://arxiv.org/abs/2503.06598)
Keywords: segmentation
Abstract: 3D neuroimages provide a comprehensive view of brain structure and function, aiding in precise localization and functional connectivity analysis. Segmentation of white matter (WM) tracts using 3D neuroimages is vital for understanding the brain's structural connectivity in both healthy and diseased states. One-shot Class Incremental Semantic Segmentation (OCIS) refers to effectively segmenting new (novel) classes using only a single sample while retaining knowledge of old (base) classes without forgetting. Voxel-contrastive OCIS methods adjust the feature space to alleviate the feature overlap problem between the base and novel classes. However, since WM tract segmentation is a multi-label segmentation task, existing single-label voxel contrastive-based methods may cause inherent contradictions. To address this, we propose a new multi-label voxel contrast framework called MultiCo3D for one-shot class incremental tract segmentation. Our method utilizes uncertainty distillation to preserve base tract segmentation knowledge while adjusting the feature space with multi-label voxel contrast to alleviate feature overlap when learning novel tracts and dynamically weighting multi losses to balance overall loss. We compare our method against several state-of-the-art (SOTA) approaches. The experimental results show that our method significantly enhances one-shot class incremental tract segmentation accuracy across five different experimental setups on HCP and Preto datasets.

Title: StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition

Authors: Yanqing Shen, Sanping Zhou, Jingwen Fu, Ruotong Wang, Shitao Chen, Nanning Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06601
Pdf URL: https://arxiv.org/pdf/2503.06601
Copy Paste: [[2503.06601]] StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition(https://arxiv.org/abs/2503.06601)
Keywords: segmentation
Abstract: Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.

Title: FW-Shapley: Real-time Estimation of Weighted Shapley Values

Authors: Pranoy Panda, Siddharth Tandon, Vineeth N Balasubramanian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06602
Pdf URL: https://arxiv.org/pdf/2503.06602
Copy Paste: [[2503.06602]] FW-Shapley: Real-time Estimation of Weighted Shapley Values(https://arxiv.org/abs/2503.06602)
Keywords: fair
Abstract: Fair credit assignment is essential in various machine learning (ML) applications, and Shapley values have emerged as a valuable tool for this purpose. However, in critical ML applications such as data valuation and feature attribution, the uniform weighting of Shapley values across subset cardinalities leads to unintuitive credit assignments. To address this, weighted Shapley values were proposed as a generalization, allowing different weights for subsets with different cardinalities. Despite their advantages, similar to Shapley values, Weighted Shapley values suffer from exponential compute costs, making them impractical for high-dimensional datasets. To tackle this issue, we present two key contributions. Firstly, we provide a weighted least squares characterization of weighted Shapley values. Next, using this characterization, we propose Fast Weighted Shapley (FW-Shapley), an amortized framework for efficiently computing weighted Shapley values using a learned estimator. We further show that our estimator's training procedure is theoretically valid even though we do not use ground truth Weighted Shapley values during training. On the feature attribution task, we outperform the learned estimator FastSHAP by $27\%$ (on average) in terms of Inclusion AUC. For data valuation, we are much faster (14 times) while being comparable to the state-of-the-art KNN Shapley.

Title: Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation

Authors: Renhao Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06604
Pdf URL: https://arxiv.org/pdf/2503.06604
Copy Paste: [[2503.06604]] Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation(https://arxiv.org/abs/2503.06604)
Keywords: segmentation
Abstract: Semantic segmentation is a core task in computer vision with applications in biomedical imaging, remote sensing, and autonomous driving. While standard loss functions such as cross-entropy and Dice loss perform well in general cases, they often struggle with fine structures, particularly in tasks involving thin structures or closely packed objects. Various weight map-based loss functions have been proposed to address this issue by assigning higher loss weights to pixels prone to misclassification. However, these methods typically rely on precomputed or runtime-generated weight maps based on distance transforms, which impose significant computational costs and fail to adapt to evolving network predictions. In this paper, we propose a novel steerable pyramid-based weighted (SPW) loss function that efficiently generates adaptive weight maps. Unlike traditional boundary-aware losses that depend on static or iteratively updated distance maps, our method leverages steerable pyramids to dynamically emphasize regions across multiple frequency bands (capturing features at different scales) while maintaining computational efficiency. Additionally, by incorporating network predictions into the weight computation, our approach enables adaptive refinement during training. We evaluate our method on the SNEMI3D, GlaS, and DRIVE datasets, benchmarking it against 11 state-of-the-art loss functions. Our results demonstrate that the proposed SPW loss function achieves superior pixel precision and segmentation accuracy with minimal computational overhead. This work provides an effective and efficient solution for improving semantic segmentation, particularly for applications requiring multiscale feature representation. The code is avaiable at this https URL

Title: Interpretable Model Drift Detection

Authors: Pranoy Panda, Kancheti Sai Srinivas, Vineeth N Balasubramanian, Gaurav Sinha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06606
Pdf URL: https://arxiv.org/pdf/2503.06606
Copy Paste: [[2503.06606]] Interpretable Model Drift Detection(https://arxiv.org/abs/2503.06606)
Keywords: interpretability
Abstract: Data in the real world often has an evolving distribution. Thus, machine learning models trained on such data get outdated over time. This phenomenon is called model drift. Knowledge of this drift serves two purposes: (i) Retain an accurate model and (ii) Discovery of knowledge or insights about change in the relationship between input features and output variable w.r.t. the model. Most existing works focus only on detecting model drift but offer no interpretability. In this work, we take a principled approach to study the problem of interpretable model drift detection from a risk perspective using a feature-interaction aware hypothesis testing framework, which enjoys guarantees on test power. The proposed framework is generic, i.e., it can be adapted to both classification and regression tasks. Experiments on several standard drift detection datasets show that our method is superior to existing interpretable methods (especially on real-world datasets) and on par with state-of-the-art black-box drift detection methods. We also quantitatively and qualitatively study the interpretability aspect including a case study on USENET2 dataset. We find our method focuses on model and drift sensitive features compared to baseline interpretable drift detectors.

Title: GroMo: Plant Growth Modeling with Multiview Images

Authors: Ruchi Bhatt, Shreya Bansal, Amanpreet Chander, Rupinder Kaur, Malya Singh, Mohan Kankanhalli, Abdulmotaleb El Saddik, Mukesh Kumar Saini
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.06608
Pdf URL: https://arxiv.org/pdf/2503.06608
Copy Paste: [[2503.06608]] GroMo: Plant Growth Modeling with Multiview Images(https://arxiv.org/abs/2503.06608)
Keywords: transformer
Abstract: Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at this https URL.

Title: Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models

Authors: Nachiket U. Bapat, Randy C. Paffenroth, Raghvendra V. Cowlagi
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.06619
Pdf URL: https://arxiv.org/pdf/2503.06619
Copy Paste: [[2503.06619]] Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models(https://arxiv.org/abs/2503.06619)
Keywords: generative
Abstract: We study the problem of synthetic generation of samples of environmental features for autonomous vehicle navigation. These features are described by a spatiotemporally varying scalar field that we refer to as a threat field. The threat field is known to have some underlying dynamics subject to process noise. Some "real-world" data of observations of various threat fields are also available. The assumption is that the volume of ``real-world'' data is relatively small. The objective is to synthesize samples that are statistically similar to the data. The proposed solution is a generative artificial intelligence model that we refer to as a split variational recurrent neural network (S-VRNN). The S-VRNN merges the capabilities of a variational autoencoder, which is a widely used generative model, and a recurrent neural network, which is used to learn temporal dependencies in data. The main innovation in this work is that we split the latent space of the S-VRNN into two subspaces. The latent variables in one subspace are learned using the ``real-world'' data, whereas those in the other subspace are learned using the data as well as the known underlying system dynamics. Through numerical experiments we demonstrate that the proposed S-VRNN can synthesize data that are statistically similar to the training data even in the case of very small volume of ``real-world'' training data.

Title: Dynamic Updates for Language Adaptation in Visual-Language Tracking

Authors: Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06621
Pdf URL: https://arxiv.org/pdf/2503.06621
Copy Paste: [[2503.06621]] Dynamic Updates for Language Adaptation in Visual-Language Tracking(https://arxiv.org/abs/2503.06621)
Keywords: robust, large language model
Abstract: The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at this https URL.

Title: Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

Authors: Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, Shuxiang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06625
Pdf URL: https://arxiv.org/pdf/2503.06625
Copy Paste: [[2503.06625]] Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking(https://arxiv.org/abs/2503.06625)
Keywords: transformer
Abstract: Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at this https URL.

Title: DiffCLIP: Differential Attention Meets CLIP

Authors: Hasan Abed Al Kader Hammoud, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06626
Pdf URL: https://arxiv.org/pdf/2503.06626
Copy Paste: [[2503.06626]] DiffCLIP: Differential Attention Meets CLIP(https://arxiv.org/abs/2503.06626)
Keywords: robust, large language model
Abstract: We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at this https URL.

Title: Revisiting Early Detection of Sexual Predators via Turn-level Optimization

Authors: Jinmyeong An, Sangwon Ryu, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.06627
Pdf URL: https://arxiv.org/pdf/2503.06627
Copy Paste: [[2503.06627]] Revisiting Early Detection of Sexual Predators via Turn-level Optimization(https://arxiv.org/abs/2503.06627)
Keywords: protect
Abstract: Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at this https URL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.

Title: BTFL: A Bayesian-based Test-Time Generalization Method for Internal and External Data Distributions in Federated learning

Authors: Yu Zhou, Bingyan Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06633
Pdf URL: https://arxiv.org/pdf/2503.06633
Copy Paste: [[2503.06633]] BTFL: A Bayesian-based Test-Time Generalization Method for Internal and External Data Distributions in Federated learning(https://arxiv.org/abs/2503.06633)
Keywords: privacy, federate
Abstract: Federated Learning (FL) enables multiple clients to collaboratively develop a global model while maintaining data privacy. However, online FL deployment faces challenges due to distribution shifts and evolving test samples. Personalized Federated Learning (PFL) tailors the global model to individual client distributions, but struggles with Out-Of-Distribution (OOD) samples during testing, leading to performance degradation. In real-world scenarios, balancing personalization and generalization during online testing is crucial and existing methods primarily focus on training-phase generalization. To address the test-time trade-off, we introduce a new scenario: Test-time Generalization for Internal and External Distributions in Federated Learning (TGFL), which evaluates adaptability under Internal Distribution (IND) and External Distribution (EXD). We propose BTFL, a Bayesian-based test-time generalization method for TGFL, which balances generalization and personalization at the sample level during testing. BTFL employs a two-head architecture to store local and global knowledge, interpolating predictions via a dual-Bayesian framework that considers both historical test data and current sample characteristics with theoretical guarantee and faster speed. Our experiments demonstrate that BTFL achieves improved performance across various datasets and models with less time cost. The source codes are made publicly available at this https URL .

Title: CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Authors: Lei Shi, Andreas Bulling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06637
Pdf URL: https://arxiv.org/pdf/2503.06637
Copy Paste: [[2503.06637]] CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning(https://arxiv.org/abs/2503.06637)
Keywords: diffusion
Abstract: We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

Title: Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training

Authors: Hender Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06648
Pdf URL: https://arxiv.org/pdf/2503.06648
Copy Paste: [[2503.06648]] Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training(https://arxiv.org/abs/2503.06648)
Keywords: robust, large language model
Abstract: Standard NLP benchmarks often fail to capture vulnerabilities stemming from dataset artifacts and spurious correlations. Contrast sets address this gap by challenging models near decision boundaries but are traditionally labor-intensive to create and limited in diversity. This study leverages large language models to automate the generation of diverse contrast sets. Using the SNLI dataset, we created a 3,000-example contrast set to evaluate and improve model robustness. Fine-tuning on these contrast sets enhanced performance on systematically perturbed examples, maintained standard test accuracy, and modestly improved generalization to novel perturbations. This automated approach offers a scalable solution for evaluating and improving NLP models, addressing systematic generalization challenges, and advancing robustness in real-world applications.

Title: Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Authors: Yihong Luo, Tianyang Hu, Yifan Song, Jiacheng Sun, Zhenguo Li, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06652
Pdf URL: https://arxiv.org/pdf/2503.06652
Copy Paste: [[2503.06652]] Adding Additional Control to One-Step Diffusion with Joint Distribution Matching(https://arxiv.org/abs/2503.06652)
Keywords: diffusion
Abstract: While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.

Title: AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

Authors: Yang Zou, Zhaoshuai Qi, Yating Liu, Zihao Xu, Weipeng Sun, Weiyi Liu, Xingyuan Li, Jiaqi Yang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06660
Pdf URL: https://arxiv.org/pdf/2503.06660
Copy Paste: [[2503.06660]] AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation(https://arxiv.org/abs/2503.06660)
Keywords: robust, extraction, diffusion
Abstract: Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.

Title: Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

Authors: Tommaso Bendinelli, Artur Dox, Christian Holz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06664
Pdf URL: https://arxiv.org/pdf/2503.06664
Copy Paste: [[2503.06664]] Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets(https://arxiv.org/abs/2503.06664)
Keywords: large language model
Abstract: High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

Title: Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On

Authors: Roni Goldshmidt
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.06670
Pdf URL: https://arxiv.org/pdf/2503.06670
Copy Paste: [[2503.06670]] Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On(https://arxiv.org/abs/2503.06670)
Keywords: interpretability, segmentation
Abstract: Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.

Title: Emulating Self-attention with Convolution for Efficient Image Super-Resolution

Authors: Dongheon Lee, Seokju Yun, Youngmin Ro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06671
Pdf URL: https://arxiv.org/pdf/2503.06671
Copy Paste: [[2503.06671]] Emulating Self-attention with Convolution for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.06671)
Keywords: transformer
Abstract: In this paper, we tackle the high computational overhead of transformers for lightweight image super-resolution. (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.

Title: Learning Few-Step Diffusion Models by Trajectory Distribution Matching

Authors: Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06674
Pdf URL: https://arxiv.org/pdf/2503.06674
Copy Paste: [[2503.06674]] Learning Few-Step Diffusion Models by Trajectory Distribution Matching(https://arxiv.org/abs/2503.06674)
Keywords: diffusion, data-free
Abstract: Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality. To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student's trajectory with the teacher's at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling. This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs. In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere 0.01% of the teacher's training cost. In addition, our proposed TDM can be extended to accelerate text-to-video diffusion. Notably, TDM can outperform its teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total score from 80.91 to 81.65. Project page: this https URL

Title: Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

Authors: Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06676
Pdf URL: https://arxiv.org/pdf/2503.06676
Copy Paste: [[2503.06676]] Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform(https://arxiv.org/abs/2503.06676)
Keywords: data-free, transformer
Abstract: With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Title: Dynamic Dictionary Learning for Remote Sensing Image Segmentation

Authors: Xuechao Zou, Yue Li, Shun Zhang, Kai Li, Shiying Wang, Pin Tao, Junliang Xing, Congyan Lang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06683
Pdf URL: https://arxiv.org/pdf/2503.06683
Copy Paste: [[2503.06683]] Dynamic Dictionary Learning for Remote Sensing Image Segmentation(https://arxiv.org/abs/2503.06683)
Keywords: segmentation
Abstract: Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at this https URL.

Title: PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

Authors: Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06684
Pdf URL: https://arxiv.org/pdf/2503.06684
Copy Paste: [[2503.06684]] PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation(https://arxiv.org/abs/2503.06684)
Keywords: diffusion
Abstract: Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

Title: Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Authors: Zhaowei Chen, Borui Zhao, Yuchen Ge, Yuhao Chen, Renjie Song, Jiajun Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06685
Pdf URL: https://arxiv.org/pdf/2503.06685
Copy Paste: [[2503.06685]] Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence(https://arxiv.org/abs/2503.06685)
Keywords: diffusion, segmentation
Abstract: Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Title: UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

Authors: Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Yingce Xia, Marwin Segler, Maik Riechert, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, physics.bio-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.06687
Pdf URL: https://arxiv.org/pdf/2503.06687
Copy Paste: [[2503.06687]] UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion(https://arxiv.org/abs/2503.06687)
Keywords: diffusion, generative
Abstract: Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.

Title: Censoring-Aware Tree-Based Reinforcement Learning for Estimating Dynamic Treatment Regimes with Censored Outcomes

Authors: Animesh Kumar Paul, Russell Greiner
Subjects: cs.LG, cs.AI, stat.ME
Abstract URL: https://arxiv.org/abs/2503.06690
Pdf URL: https://arxiv.org/pdf/2503.06690
Copy Paste: [[2503.06690]] Censoring-Aware Tree-Based Reinforcement Learning for Estimating Dynamic Treatment Regimes with Censored Outcomes(https://arxiv.org/abs/2503.06690)
Keywords: robust
Abstract: Dynamic Treatment Regimes (DTRs) provide a systematic approach for making sequential treatment decisions that adapt to individual patient characteristics, particularly in clinical contexts where survival outcomes are of interest. Censoring-Aware Tree-Based Reinforcement Learning (CA-TRL) is a novel framework to address the complexities associated with censored data when estimating optimal DTRs. We explore ways to learn effective DTRs, from observational data. By enhancing traditional tree-based reinforcement learning methods with augmented inverse probability weighting (AIPW) and censoring-aware modifications, CA-TRL delivers robust and interpretable treatment strategies. We demonstrate its effectiveness through extensive simulations and real-world applications using the SANAD epilepsy dataset, where it outperformed the recently proposed ASCL method in key metrics such as restricted mean survival time (RMST) and decision-making accuracy. This work represents a step forward in advancing personalized and data-driven treatment strategies across diverse healthcare settings.

Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06692
Pdf URL: https://arxiv.org/pdf/2503.06692
Copy Paste: [[2503.06692]] InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models(https://arxiv.org/abs/2503.06692)
Keywords: large language model
Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.

Title: What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

Authors: Xavier Thomas, Deepti Ghadiyaram
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06698
Pdf URL: https://arxiv.org/pdf/2503.06698
Copy Paste: [[2503.06698]] What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization(https://arxiv.org/abs/2503.06698)
Keywords: diffusion
Abstract: Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training.

Title: Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping

Authors: Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das, Arnaud Demortière
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06699
Pdf URL: https://arxiv.org/pdf/2503.06699
Copy Paste: [[2503.06699]] Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping(https://arxiv.org/abs/2503.06699)
Keywords: robust
Abstract: This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-Component Loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.

Title: MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation

Authors: Chenfei Liao, Xu Zheng, Yuanhuiyi Lyu, Haiwei Xue, Yihong Cao, Jiawen Wang, Kailun Yang, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06700
Pdf URL: https://arxiv.org/pdf/2503.06700
Copy Paste: [[2503.06700]] MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation(https://arxiv.org/abs/2503.06700)
Keywords: segmentation
Abstract: Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-modal data? 2. How can SAM2 better understand semantics? Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality-agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi-modal data to capture modality-agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training-only Semantic Prototype Memory Module (SPMM) to store category-level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real-world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.

Title: PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts

Authors: Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06706
Pdf URL: https://arxiv.org/pdf/2503.06706
Copy Paste: [[2503.06706]] PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts(https://arxiv.org/abs/2503.06706)
Keywords: large language model
Abstract: Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in this https URL.

Title: Alignment for Efficient Tool Calling of Large Language Models

Authors: Hongshen Xu, Zihan Wang, Zichen Zhu, Lei Pan, Xingyu Chen, Lu Chen, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06708
Pdf URL: https://arxiv.org/pdf/2503.06708
Copy Paste: [[2503.06708]] Alignment for Efficient Tool Calling of Large Language Models(https://arxiv.org/abs/2503.06708)
Keywords: large language model
Abstract: Recent advancements in tool learning have enabled large language models (LLMs) to integrate external tools, enhancing their task performance by expanding their knowledge boundaries. However, relying on tools often introduces tradeoffs between performance, speed, and cost, with LLMs sometimes exhibiting overreliance and overconfidence in tool usage. This paper addresses the challenge of aligning LLMs with their knowledge boundaries to make more intelligent decisions about tool invocation. We propose a multi objective alignment framework that combines probabilistic knowledge boundary estimation with dynamic decision making, allowing LLMs to better assess when to invoke tools based on their confidence. Our framework includes two methods for knowledge boundary estimation, consistency based and absolute estimation, and two training strategies for integrating these estimates into the model decision making process. Experimental results on various tool invocation scenarios demonstrate the effectiveness of our framework, showing significant improvements in tool efficiency by reducing unnecessary tool usage.

Title: Delusions of Large Language Models

Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06709
Pdf URL: https://arxiv.org/pdf/2503.06709
Copy Paste: [[2503.06709]] Delusions of Large Language Models(https://arxiv.org/abs/2503.06709)
Keywords: large language model
Abstract: Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.

Title: Continuous Online Adaptation Driven by User Interaction for Medical Image Segmentation

Authors: Wentian Xu, Ziyun Liang, Harry Anthony, Yasin Ibrahim, Felix Cohen, Guang Yang, Daniel Whitehouse, David Menon, Virginia Newcombe, Konstantinos Kamnitsas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06717
Pdf URL: https://arxiv.org/pdf/2503.06717
Copy Paste: [[2503.06717]] Continuous Online Adaptation Driven by User Interaction for Medical Image Segmentation(https://arxiv.org/abs/2503.06717)
Keywords: segmentation
Abstract: Interactive segmentation models use real-time user interactions, such as mouse clicks, as extra inputs to dynamically refine the model predictions. After model deployment, user corrections of model predictions could be used to adapt the model to the post-deployment data distribution, countering distribution-shift and enhancing reliability. Motivated by this, we introduce an online adaptation framework that enables an interactive segmentation model to continuously learn from user interaction and improve its performance on new data distributions, as it processes a sequence of test images. We introduce the Gaussian Point Loss function to train the model how to leverage user clicks, along with a two-stage online optimization method that adapts the model using the corrected predictions generated via user interactions. We demonstrate that this simple and therefore practical approach is very effective. Experiments on 5 fundus and 4 brain MRI databases demonstrate that our method outperforms existing approaches under various data distribution shifts, including segmentation of image modalities and pathologies not seen during training.

Title: Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention

Authors: Matthew Shen, Aliyah Hsu, Abhineet Agarwal, Bin Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06730
Pdf URL: https://arxiv.org/pdf/2503.06730
Copy Paste: [[2503.06730]] Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention(https://arxiv.org/abs/2503.06730)
Keywords: interpretability
Abstract: Concept bottleneck models~(CBM) aim to improve model interpretability by predicting human level ``concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees~(FIGS) to obtain Binary Distillation~(BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while mimicking the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across $4$ datasets, we demonstrate that adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that allow for limited concept interventions.

Title: Data Efficient Subset Training with Differential Privacy

Authors: Ninad Jayesh Gandhi, Moparthy Venkata Subrahmanya Sri Harsha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06732
Pdf URL: https://arxiv.org/pdf/2503.06732
Copy Paste: [[2503.06732]] Data Efficient Subset Training with Differential Privacy(https://arxiv.org/abs/2503.06732)
Keywords: privacy
Abstract: Private machine learning introduces a trade-off between the privacy budget and training performance. Training convergence is substantially slower and extensive hyper parameter tuning is required. Consequently, efficient methods to conduct private training of models is thoroughly investigated in the literature. To this end, we investigate the strength of the data efficient model training methods in the private training setting. We adapt GLISTER (Killamsetty et al., 2021b) to the private setting and extensively assess its performance. We empirically find that practical choices of privacy budgets are too restrictive for data efficient training in the private setting.

Title: D3DR: Lighting-Aware Object Insertion in Gaussian Splatting

Authors: Vsevolod Skorokhodov, Nikita Durasov, Pascal Fua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06740
Pdf URL: https://arxiv.org/pdf/2503.06740
Copy Paste: [[2503.06740]] D3DR: Lighting-Aware Object Insertion in Gaussian Splatting(https://arxiv.org/abs/2503.06740)
Keywords: diffusion
Abstract: Gaussian Splatting has become a popular technique for various 3D Computer Vision tasks, including novel view synthesis, scene reconstruction, and dynamic scene rendering. However, the challenge of natural-looking object insertion, where the object's appearance seamlessly matches the scene, remains unsolved. In this work, we propose a method, dubbed D3DR, for inserting a 3DGS-parametrized object into 3DGS scenes while correcting its lighting, shadows, and other visual artifacts to ensure consistency, a problem that has not been successfully addressed before. We leverage advances in diffusion models, which, trained on real-world data, implicitly understand correct scene lighting. After inserting the object, we optimize a diffusion-based Delta Denoising Score (DDS)-inspired objective to adjust its 3D Gaussian parameters for proper lighting correction. Utilizing diffusion model personalization techniques to improve optimization quality, our approach ensures seamless object insertion and natural appearance. Finally, we demonstrate the method's effectiveness by comparing it to existing approaches, achieving 0.5 PSNR and 0.15 SSIM improvements in relighting quality.

Title: CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

Authors: Rui Song, Chenwei Liang, Yan Xia, Walter Zimmer, Hu Cao, Holger Caesar, Andreas Festag, Alois Knoll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06744
Pdf URL: https://arxiv.org/pdf/2503.06744
Copy Paste: [[2503.06744]] CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving(https://arxiv.org/abs/2503.06744)
Keywords: segmentation
Abstract: Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.

Title: Color Alignment in Diffusion

Authors: Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06746
Pdf URL: https://arxiv.org/pdf/2503.06746
Copy Paste: [[2503.06746]] Color Alignment in Diffusion(https://arxiv.org/abs/2503.06746)
Keywords: diffusion, generative
Abstract: Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.

Title: DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion

Authors: Hantao Zhang, Yuhe Liu, Jiancheng Yang, Weidong Guo, Xinyuan Wang, Pascal Fua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06748
Pdf URL: https://arxiv.org/pdf/2503.06748
Copy Paste: [[2503.06748]] DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion(https://arxiv.org/abs/2503.06748)
Keywords: robust, diffusion, generative, segmentation
Abstract: Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at this https URL.

Title: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Authors: Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, Shaohui Lin
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06749
Pdf URL: https://arxiv.org/pdf/2503.06749
Copy Paste: [[2503.06749]] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models(https://arxiv.org/abs/2503.06749)
Keywords: large language model
Abstract: DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: this https URL .

Title: Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints

Authors: Max Buckley, Konstantinos Papathanasiou, Andreas Spanopoulos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06751
Pdf URL: https://arxiv.org/pdf/2503.06751
Copy Paste: [[2503.06751]] Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints(https://arxiv.org/abs/2503.06751)
Keywords: generative
Abstract: This paper addresses the challenge of solving Constrained Markov Decision Processes (CMDPs) with $d > 1$ constraints when the transition dynamics are unknown, but samples can be drawn from a generative model. We propose a model-based algorithm for infinite horizon CMDPs with multiple constraints in the tabular setting, aiming to derive and prove sample complexity bounds for learning near-optimal policies. Our approach tackles both the relaxed and strict feasibility settings, where relaxed feasibility allows some constraint violations, and strict feasibility requires adherence to all constraints. The main contributions include the development of the algorithm and the derivation of sample complexity bounds for both settings. For the relaxed feasibility setting we show that our algorithm requires $\tilde{\mathcal{O}} \left( \frac{d |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^3\epsilon^2} \right)$ samples to return $\epsilon$-optimal policy, while in the strict feasibility setting it requires $\tilde{\mathcal{O}} \left( \frac{d^3 |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^5\epsilon^2{\zeta_{\mathbf{c}}^*}^2} \right)$ samples.

Title: Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets

Authors: Hung Q. Vo, Samira Zare, Son T. Ly, Lin Wang, Chika F. Ezeana, Xiaohui Yu, Kelvin K. Wong, Stephen T.C. Wong, Hien V. Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06759
Pdf URL: https://arxiv.org/pdf/2503.06759
Copy Paste: [[2503.06759]] Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets(https://arxiv.org/abs/2503.06759)
Keywords: robust, interpretability
Abstract: Despite significant progress in robust deep learning techniques for mammogram breast cancer classification, their reliability in real-world clinical development settings remains uncertain. The translation of these models to clinical practice faces challenges due to variations in medical centers, imaging protocols, and patient populations. To enhance their robustness, invariant learning methods have been proposed, prioritizing causal factors over misleading features. However, their effectiveness in clinical development and impact on mammogram classification require investigation. This paper reassesses the application of invariant learning for breast cancer risk estimation based on mammograms. Utilizing diverse multi-site public datasets, it represents the first study in this area. The objective is to evaluate invariant learning's benefits in developing robust models. Invariant learning methods, including Invariant Risk Minimization and Variance Risk Extrapolation, are compared quantitatively against Empirical Risk Minimization. Evaluation metrics include accuracy, average precision, and area under the curve. Additionally, interpretability is examined through class activation maps and visualization of learned representations. This research examines the advantages, limitations, and challenges of invariant learning for mammogram classification, guiding future studies to develop generalized methods for breast cancer prediction on whole mammograms in out-of-domain scenarios.

Title: SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Authors: Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06764
Pdf URL: https://arxiv.org/pdf/2503.06764
Copy Paste: [[2503.06764]] SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation(https://arxiv.org/abs/2503.06764)
Keywords: extraction
Abstract: We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.

Title: Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Authors: Feng Gu, Zongxia Li, Carlos Rafael Colon, Benjamin Evans, Ishani Mondal, Jordan Lee Boyd-Graber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06778
Pdf URL: https://arxiv.org/pdf/2503.06778
Copy Paste: [[2503.06778]] Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators(https://arxiv.org/abs/2503.06778)
Keywords: extraction, large language model
Abstract: Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

Title: Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting

Authors: Yufei Li, John Nham, Ganesh Jawahar, Lei Shu, David Uthus, Yun-Hsuan Sung, Chengrun Yang, Itai Rolnick, Yi Qiao, Cong Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06781
Pdf URL: https://arxiv.org/pdf/2503.06781
Copy Paste: [[2503.06781]] Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting(https://arxiv.org/abs/2503.06781)
Keywords: large language model
Abstract: Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).

Title: Key Establishment in the Space Environment

Authors: Benjamin Dowling, Britta Hale, Xisen Tian, Bhagya Wimalasiri
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2503.06785
Pdf URL: https://arxiv.org/pdf/2503.06785
Copy Paste: [[2503.06785]] Key Establishment in the Space Environment(https://arxiv.org/abs/2503.06785)
Keywords: security
Abstract: As reliance on space systems continues to increase, so does the need to ensure security for them. However, public work in space standards have struggled with defining security protocols that are well tailored to the domain and its risks. In this work, we investigate various space networking paradigms and security approaches, and identify trade-offs and gaps. Furthermore, we describe potential existing security protocol approaches that fit well into the space network paradigm in terms of both functionality and security. Finally, we establish future directions for enabling strong security for space communication.

Title: GenDR: Lightning Generative Detail Restorator

Authors: Yan Wang, Shijie Zhao, Kai Chen, Kexin Zhang, Junlin Li, Li Zhang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06790
Pdf URL: https://arxiv.org/pdf/2503.06790
Copy Paste: [[2503.06790]] GenDR: Lightning Generative Detail Restorator(https://arxiv.org/abs/2503.06790)
Keywords: diffusion, generative
Abstract: Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

Title: VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

Authors: Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06800
Pdf URL: https://arxiv.org/pdf/2503.06800
Copy Paste: [[2503.06800]] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation(https://arxiv.org/abs/2503.06800)
Keywords: generative
Abstract: Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at this https URL.

Title: Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts

Authors: Aref Farhadipour, Hossein Ranjbar, Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.06805
Pdf URL: https://arxiv.org/pdf/2503.06805
Copy Paste: [[2503.06805]] Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts(https://arxiv.org/abs/2503.06805)
Keywords: transformer
Abstract: Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.

Title: Privacy Auditing of Large Language Models

Authors: Ashwinee Panda, Xinyu Tang, Milad Nasr, Christopher A. Choquette-Choo, Prateek Mittal
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06808
Pdf URL: https://arxiv.org/pdf/2503.06808
Copy Paste: [[2503.06808]] Privacy Auditing of Large Language Models(https://arxiv.org/abs/2503.06808)
Keywords: privacy, attack, membership infer, large language model
Abstract: Current techniques for privacy auditing of large language models (LLMs) have limited efficacy -- they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries surpass prior approaches. For example, on the Qwen2.5-0.5B model, our designed canaries achieve $49.6\%$ TPR at $1\%$ FPR, vastly surpassing the prior approach's $4.2\%$ TPR at $1\%$ FPR. Our method can be used to provide a privacy audit of $\varepsilon \approx 1$ for a model trained with theoretical $\varepsilon$ of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.

Title: Mitigating Preference Hacking in Policy Optimization with Pessimism

Authors: Dhawal Gupta, Adam Fisch, Christoph Dann, Alekh Agarwal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06810
Pdf URL: https://arxiv.org/pdf/2503.06810
Copy Paste: [[2503.06810]] Mitigating Preference Hacking in Policy Optimization with Pessimism(https://arxiv.org/abs/2503.06810)
Keywords: robust
Abstract: This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating remarkable resilience to overoptimization.

Title: Towards Fine-Grained Video Question Answering

Authors: Wei Dai, Alan Luo, Zane Durante, Debadutta Dash, Arnold Milstein, Kevin Schulman, Ehsan Adeli, Li Fei-Fei
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06820
Pdf URL: https://arxiv.org/pdf/2503.06820
Copy Paste: [[2503.06820]] Towards Fine-Grained Video Question Answering(https://arxiv.org/abs/2503.06820)
Keywords: large language model
Abstract: In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

Title: HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors

Authors: Siyu Li, Yihong Cao, Hao Shi, Yongsheng Zang, Xuan He, Kailun Yang, Zhiyong Li
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06821
Pdf URL: https://arxiv.org/pdf/2503.06821
Copy Paste: [[2503.06821]] HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors(https://arxiv.org/abs/2503.06821)
Keywords: segmentation
Abstract: The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at this https URL.

Title: eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference

Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.06823
Pdf URL: https://arxiv.org/pdf/2503.06823
Copy Paste: [[2503.06823]] eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference(https://arxiv.org/abs/2503.06823)
Keywords: large language model
Abstract: In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.

Title: GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

Authors: Sungsik Kim, Janghyun Baek, Jinkyu Kim, Jaekoo Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06832
Pdf URL: https://arxiv.org/pdf/2503.06832
Copy Paste: [[2503.06832]] GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought(https://arxiv.org/abs/2503.06832)
Keywords: large language model
Abstract: While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at this https URL.

Title: AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU

Authors: Zhuowen Zheng, Yain-Whar Si, Xiaochen Yuan, Junwei Duan, Ke Wang, Xiaofan Li, Xinyuan Zhang, Xueyuan Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06839
Pdf URL: https://arxiv.org/pdf/2503.06839
Copy Paste: [[2503.06839]] AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU(https://arxiv.org/abs/2503.06839)
Keywords: generative
Abstract: Nowadays, with the advancement of deep neural networks (DNNs) and the availability of large-scale datasets, the face recognition (FR) model has achieved exceptional performance. However, since the parameter magnitude of the fully connected (FC) layer directly depends on the number of identities in the dataset. If training the FR model on large-scale datasets, the size of the model parameter will be excessively huge, leading to substantial demand for computational resources, such as time and memory. This paper proposes the attention fully connected (AttFC) layer, which could significantly reduce computational resources. AttFC employs an attention loader to generate the generative class center (GCC), and dynamically store the class center with Dynamic Class Container (DCC). DCC only stores a small subset of all class centers in FC, thus its parameter count is substantially less than the FC layer. Also, training face recognition models on large-scale datasets with one GPU often encounter out-of-memory (OOM) issues. AttFC overcomes this and achieves comparable performance to state-of-the-art methods.

Title: MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification

Authors: Xiangyan Qu, Jing Yu, Jiamin Zhuang, Gaopeng Gou, Gang Xiong, Qi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06847
Pdf URL: https://arxiv.org/pdf/2503.06847
Copy Paste: [[2503.06847]] MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification(https://arxiv.org/abs/2503.06847)
Keywords: large language model
Abstract: Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.

Title: From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction

Authors: Yihong Leng, Jiaojiao Li, Haitao Xu, Rui Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06852
Pdf URL: https://arxiv.org/pdf/2503.06852
Copy Paste: [[2503.06852]] From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction(https://arxiv.org/abs/2503.06852)
Keywords: extraction
Abstract: Current hyperspectral image (HSI) reconstruction methods primarily rely on image-level approaches, which are time-consuming to form abundant high-quality HSIs through imagers. In contrast, spectrometers offer a more efficient alternative by capturing high-fidelity point spectra, enabling pixel-level HSI reconstruction that balances accuracy and label efficiency. To this end, we introduce a pixel-level spectral super-resolution (Pixel-SSR) paradigm that reconstructs HSI from RGB and point spectra. Despite its advantages, Pixel-SSR presents two key challenges: 1) generalizability to novel scenes lacking point spectra, and 2) effective information extraction to promote reconstruction accuracy. To address the first challenge, a Gamma-modeled strategy is investigated to synthesize point spectra based on their intrinsic properties, including nonnegativity, a skewed distribution, and a positive correlation. Furthermore, complementary three-branch prompts from RGB and point spectra are extracted with a Dynamic Prompt Mamba (DyPro-Mamba), which progressively directs the reconstruction with global spatial distributions, edge details, and spectral dependency. Comprehensive evaluations, including horizontal comparisons with leading methods and vertical assessments across unsupervised and image-level supervised paradigms, demonstrate that ours achieves competitive reconstruction accuracy with efficient label consumption.

Title: Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting

Authors: Cagri Gungor, Derek Eppinger, Adriana Kovashka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06860
Pdf URL: https://arxiv.org/pdf/2503.06860
Copy Paste: [[2503.06860]] Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting(https://arxiv.org/abs/2503.06860)
Keywords: robust
Abstract: Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.

Title: Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention

Authors: Mengzhe Hei, Zhouran Zhang, Qingbao Liu, Yan Pan, Xiang Zhao, Yongqian Peng, Yicong Ye, Xin Zhang, Shuxin Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06861
Pdf URL: https://arxiv.org/pdf/2503.06861
Copy Paste: [[2503.06861]] Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention(https://arxiv.org/abs/2503.06861)
Keywords: robust, extraction, large language model
Abstract: Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.

Title: ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06881
Pdf URL: https://arxiv.org/pdf/2503.06881
Copy Paste: [[2503.06881]] ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration(https://arxiv.org/abs/2503.06881)
Keywords: transformer, large language model
Abstract: Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at this https URL.

Title: Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Authors: Yuefan Cao, Xuyang Guo, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang, Zhen Zhuang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06884
Pdf URL: https://arxiv.org/pdf/2503.06884
Copy Paste: [[2503.06884]] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help(https://arxiv.org/abs/2503.06884)
Keywords: diffusion, generative
Abstract: Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.

Title: ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Authors: Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06885
Pdf URL: https://arxiv.org/pdf/2503.06885
Copy Paste: [[2503.06885]] ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks(https://arxiv.org/abs/2503.06885)
Keywords: large language model
Abstract: Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

Title: Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images

Authors: Xiaoqian Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06894
Pdf URL: https://arxiv.org/pdf/2503.06894
Copy Paste: [[2503.06894]] Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images(https://arxiv.org/abs/2503.06894)
Keywords: transformer, segmentation
Abstract: In Recent Years, Digital Technologies Have Made Significant Strides In Augmenting-Human-Health, Cognition, And Perception, Particularly Within The Field Of Computational-Pathology. This Paper Presents A Novel Approach To Enhancing The Analysis Of Histopathology Images By Leveraging A Mult-modal-Model That Combines Vision Transformers (Vit) With Gpt-2 For Image Captioning. The Model Is Fine-Tuned On The Specialized Arch-Dataset, Which Includes Dense Image Captions Derived From Clinical And Academic Resources, To Capture The Complexities Of Pathology Images Such As Tissue Morphologies, Staining Variations, And Pathological Conditions. By Generating Accurate, Contextually Captions, The Model Augments The Cognitive Capabilities Of Healthcare Professionals, Enabling More Efficient Disease Classification, Segmentation, And Detection. The Model Enhances The Perception Of Subtle Pathological Features In Images That Might Otherwise Go Unnoticed, Thereby Improving Diagnostic Accuracy. Our Approach Demonstrates The Potential For Digital Technologies To Augment Human Cognitive Abilities In Medical Image Analysis, Providing Steps Toward More Personalized And Accurate Healthcare Outcomes.

Title: CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

Authors: Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06896
Pdf URL: https://arxiv.org/pdf/2503.06896
Copy Paste: [[2503.06896]] CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution(https://arxiv.org/abs/2503.06896)
Keywords: transformer
Abstract: Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.

Title: HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

Authors: Xingzu Zhan, Chen Xie, Haoran Sun, Xiaochun Mai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06897
Pdf URL: https://arxiv.org/pdf/2503.06897
Copy Paste: [[2503.06897]] HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation(https://arxiv.org/abs/2503.06897)
Keywords: extraction
Abstract: Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.

Title: Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images

Authors: S M A Sharif, Abdur Rehman, Zain Ul Abidin, Rizwan Ali Naqvi, Fayaz Ali Dharejo, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06898
Pdf URL: https://arxiv.org/pdf/2503.06898
Copy Paste: [[2503.06898]] Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images(https://arxiv.org/abs/2503.06898)
Keywords: transformer
Abstract: Digital cameras often struggle to produce plausible images in low-light conditions. Improving these single-shot images remains challenging due to a lack of diverse real-world pair data samples. To address this limitation, we propose a large-scale high-resolution (i.e., beyond 4k) pair Single-Shot Low-Light Enhancement (SLLIE) dataset. Our dataset comprises 6,425 unique focus-aligned image pairs captured with smartphone sensors in dynamic settings under challenging lighting conditions (0.1--200 lux), covering various indoor and outdoor scenes with varying noise and intensity. We extracted and refined around 180,000 non-overlapping patches from 6,025 collected scenes for training while reserving 400 pairs for benchmarking. In addition to that, we collected 2,117 low-light scenes from different sources for extensive real-world aesthetic evaluation. To our knowledge, this is the largest real-world dataset available for SLLIE research. We also propose learning luminance-chrominance (LC) attributes separately through a tuning fork-shaped transformer model to enhance real-world low-light images, addressing challenges like denoising and over-enhancement in complex scenes. We also propose an LC cross-attention block for feature fusion, an LC refinement block for enhanced reconstruction, and LC-guided supervision to ensure perceptually coherent enhancements. We demonstrated our method's effectiveness across various hardware and scenarios, proving its practicality in real-world applications. Code and dataset available at this https URL.

Title: DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

Authors: Xiaoliang Ju, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06900
Pdf URL: https://arxiv.org/pdf/2503.06900
Copy Paste: [[2503.06900]] DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation(https://arxiv.org/abs/2503.06900)
Keywords: diffusion, generative
Abstract: We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels. To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects. The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results in the text-to-3D task.

Title: When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

Authors: Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06903
Pdf URL: https://arxiv.org/pdf/2503.06903
Copy Paste: [[2503.06903]] When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack(https://arxiv.org/abs/2503.06903)
Keywords: attack, robust
Abstract: Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.

Title: You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

Authors: Shanshan Yan, Zexi Li, Chao Wu, Meng Pang, Yang Lu, Yan Yan, Hanzi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06916
Pdf URL: https://arxiv.org/pdf/2503.06916
Copy Paste: [[2503.06916]] You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data(https://arxiv.org/abs/2503.06916)
Keywords: federate
Abstract: Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-bootstrap perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-bootstrap process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4\% under global long-tailed settings.

Title: Combinatorial Optimization via LLM-driven Iterated Fine-tuning

Authors: Pranjal Awasthi, Sreenivas Gollapudi, Ravi Kumar, Kamesh Munagala
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06917
Pdf URL: https://arxiv.org/pdf/2503.06917
Copy Paste: [[2503.06917]] Combinatorial Optimization via LLM-driven Iterated Fine-tuning(https://arxiv.org/abs/2503.06917)
Keywords: large language model
Abstract: We present a novel way to integrate flexible, context-dependent constraints into combinatorial optimization by leveraging Large Language Models (LLMs) alongside traditional algorithms. Although LLMs excel at interpreting nuanced, locally specified requirements, they struggle with enforcing global combinatorial feasibility. To bridge this gap, we propose an iterated fine-tuning framework where algorithmic feedback progressively refines the LLM's output distribution. Interpreting this as simulated annealing, we introduce a formal model based on a "coarse learnability" assumption, providing sample complexity bounds for convergence. Empirical evaluations on scheduling, graph connectivity, and clustering tasks demonstrate that our framework balances the flexibility of locally expressed constraints with rigorous global optimization more effectively compared to baseline sampling methods. Our results highlight a promising direction for hybrid AI-driven combinatorial reasoning.

Title: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06923
Pdf URL: https://arxiv.org/pdf/2503.06923
Copy Paste: [[2503.06923]] From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers(https://arxiv.org/abs/2503.06923)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:this https URL

Title: Complete Key Recovery of a DNA-based Encryption and Developing a Novel Stream Cipher for Color Image Encryption: Bio-SNOW

Authors: Yash Makwana, Anupama Panigrahi, Saibal K. Pal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.06925
Pdf URL: https://arxiv.org/pdf/2503.06925
Copy Paste: [[2503.06925]] Complete Key Recovery of a DNA-based Encryption and Developing a Novel Stream Cipher for Color Image Encryption: Bio-SNOW(https://arxiv.org/abs/2503.06925)
Keywords: security, attack, robust
Abstract: Recent studies have explored DNA-based algorithms for IoT security and image encryption. A similar encryption algorithm was proposed by Al-Husainy et. al. in 2021 Recently, Al-Husainy et this http URL 2021, proposed an encryption algorithm based on DNA processes for Internet of Things(IoT) applications. Upon finding low avalanche effect in our experiments, we first report related-key attack and later propose a key recovery attack on full cipher, which generates the complete key with just two plaintext-ciphertext blocks with time complexity of O(1). Upon discovering these serious weaknesses, we improve the security of encryption algorithm against above reported attacks by employing a bio-inspired SBOX and adding some tweaks in the cipher. A lot of research has been done in developing image encryption algorithm based on DNA and chaotic maps. Inspired from the design of SNOW-3G, a well known cipher, primarily used in mobile communications and considering DNA-based functions as building blocks, we propose a new DNA-based stream cipher-Bio-SNOW. We discuss its security in response to various kinds of attacks and find that it passes all NIST randomness tests. We also find that the speed of Bio-SNOW is around twice that of SNOW-3G. Moreover, by means of histogram and correlation analysis, we find that Bio-SNOW offers robust image encryption. These results highlight Bio-SNOW as a promising DNA-based cipher for lightweight and image cryptography applications.

Title: Effect of Selection Format on LLM Performance

Authors: Yuchen Han, Yucheng Wu, Jeffrey Willard
Subjects: cs.CL, cs.AI, cs.CE, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06926
Pdf URL: https://arxiv.org/pdf/2503.06926
Copy Paste: [[2503.06926]] Effect of Selection Format on LLM Performance(https://arxiv.org/abs/2503.06926)
Keywords: large language model
Abstract: This paper investigates a critical aspect of large language model (LLM) performance: the optimal formatting of classification task options in prompts. Through an extensive experimental study, we compared two selection formats -- bullet points and plain English -- to determine their impact on model performance. Our findings suggest that presenting options via bullet points generally yields better results, although there are some exceptions. Furthermore, our research highlights the need for continued exploration of option formatting to drive further improvements in model performance.

Title: FinTSBridge: A New Evaluation Suite for Real-world Financial Prediction with Advanced Time Series Models

Authors: Yanlong Wang, Jian Xu, Tiantian Gao, Hongkang Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang
Subjects: cs.LG, q-fin.TR
Abstract URL: https://arxiv.org/abs/2503.06928
Pdf URL: https://arxiv.org/pdf/2503.06928
Copy Paste: [[2503.06928]] FinTSBridge: A New Evaluation Suite for Real-world Financial Prediction with Advanced Time Series Models(https://arxiv.org/abs/2503.06928)
Keywords: robust
Abstract: Despite the growing attention to time series forecasting in recent years, many studies have proposed various solutions to address the challenges encountered in time series prediction, aiming to improve forecasting performance. However, effectively applying these time series forecasting models to the field of financial asset pricing remains a challenging issue. There is still a need for a bridge to connect cutting-edge time series forecasting models with financial asset pricing. To bridge this gap, we have undertaken the following efforts: 1) We constructed three datasets from the financial domain; 2) We selected over ten time series forecasting models from recent studies and validated their performance in financial time series; 3) We developed new metrics, msIC and msIR, in addition to MSE and MAE, to showcase the time series correlation captured by the models; 4) We designed financial-specific tasks for these three datasets and assessed the practical performance and application potential of these forecasting models in important financial problems. We hope the developed new evaluation suite, FinTSBridge, can provide valuable insights into the effectiveness and robustness of advanced forecasting models in finanical domains.

Title: Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping

Authors: Ning Ding, Jing Han, Yuchuan Tian, Chao Xu, Kai Han, Yehui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06930
Pdf URL: https://arxiv.org/pdf/2503.06930
Copy Paste: [[2503.06930]] Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping(https://arxiv.org/abs/2503.06930)
Keywords: diffusion, transformer, large language model
Abstract: Diffusion Transformer (DiT) has now become the preferred choice for building image generation models due to its great generation capability. Unlike previous convolution-based UNet models, DiT is purely composed of a stack of transformer blocks, which renders DiT excellent in scalability like large language models. However, the growing model size and multi-step sampling paradigm bring about considerable pressure on deployment and inference. In this work, we propose a post-training quantization framework tailored for Diffusion Transforms to tackle these challenges. We firstly locate that the quantization difficulty of DiT mainly originates from the time-dependent channel-specific outliers. We propose a timestep-aware shift-and-scale strategy to smooth the activation distribution to reduce the quantization error. Secondly, based on the observation that activations of adjacent timesteps have similar distributions, we utilize a hierarchical clustering scheme to divide the denoising timesteps into multiple groups. We further design a re-parameterization scheme which absorbs the quantization parameters into nearby module to avoid redundant computations. Comprehensive experiments demonstrate that out PTQ method successfully quantize the Diffusion Transformer into 8-bit weight and 8-bit activation (W8A8) with state-of-the-art FiD score. And our method can further quantize DiT model into 4-bit weight and 8-bit activation (W4A8) without sacrificing generation quality.

Title: LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Authors: Hanyu Zhou, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06934
Pdf URL: https://arxiv.org/pdf/2503.06934
Copy Paste: [[2503.06934]] LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs(https://arxiv.org/abs/2503.06934)
Keywords: robust
Abstract: Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

Title: Modeling Human Skeleton Joint Dynamics for Fall Detection

Authors: Sania Zahan, Ghulam Mubashar Hassan, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06938
Pdf URL: https://arxiv.org/pdf/2503.06938
Copy Paste: [[2503.06938]] Modeling Human Skeleton Joint Dynamics for Fall Detection(https://arxiv.org/abs/2503.06938)
Keywords: privacy, robust, extraction
Abstract: The increasing pace of population aging calls for better care and support systems. Falling is a frequent and critical problem for elderly people causing serious long-term health issues. Fall detection from video streams is not an attractive option for real-life applications due to privacy issues. Existing methods try to resolve this issue by using very low-resolution cameras or video encryption. However, privacy cannot be ensured completely with such approaches. Key points on the body, such as skeleton joints, can convey significant information about motion dynamics and successive posture changes which are crucial for fall detection. Skeleton joints have been explored for feature extraction but with image recognition models that ignore joint dependency across frames which is important for the classification of actions. Moreover, existing models are over-parameterized or evaluated on small datasets with very few activity classes. We propose an efficient graph convolution network model that exploits spatio-temporal joint dependencies and dynamics of human skeleton joints for accurate fall detection. Our method leverages dynamic representation with robust concurrent spatio-temporal characteristics of skeleton joints. We performed extensive experiments on three large-scale datasets. With a significantly smaller model size than most existing methods, our proposed method achieves state-of-the-art results on the large scale NTU datasets.

Title: CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

Authors: Jianxiong Gao, Yichang Liu, Baofeng Yang, Jianfeng Feng, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06940
Pdf URL: https://arxiv.org/pdf/2503.06940
Copy Paste: [[2503.06940]] CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing(https://arxiv.org/abs/2503.06940)
Keywords: diffusion
Abstract: In this paper, we introduce CineBrain, the first large-scale dataset featuring simultaneous EEG and fMRI recordings during dynamic audiovisual stimulation. Recognizing the complementary strengths of EEG's high temporal resolution and fMRI's deep-brain spatial coverage, CineBrain provides approximately six hours of narrative-driven content from the popular television series The Big Bang Theory for each of six participants. Building upon this unique dataset, we propose CineSync, an innovative multimodal decoding framework integrates a Multi-Modal Fusion Encoder with a diffusion-based Neural Latent Decoder. Our approach effectively fuses EEG and fMRI signals, significantly improving the reconstruction quality of complex audiovisual stimuli. To facilitate rigorous evaluation, we introduce Cine-Benchmark, a comprehensive evaluation protocol that assesses reconstructions across semantic and perceptual dimensions. Experimental results demonstrate that CineSync achieves state-of-the-art video reconstruction performance and highlight our initial success in combining fMRI and EEG for reconstructing both video and audio stimuli. Project Page: this https URL.

Title: Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable Primitives

Authors: Jiaxin Li, Hongxing Wang, Jiawei Tan, Zhilong Ou, Junsong Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06947
Pdf URL: https://arxiv.org/pdf/2503.06947
Copy Paste: [[2503.06947]] Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable Primitives(https://arxiv.org/abs/2503.06947)
Keywords: segmentation
Abstract: Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under strong semantic priors and involve multi-stage training, thereby limiting their generalization and deployment in shape reasoning and understanding. Driven by the tendency of high-dimensional semantically similar features to lie in or near low-dimensional subspaces, we introduce a one-stage, fully unsupervised framework towards semantic-aware shape representation. This framework produces joint instance segmentation, semantic segmentation, and shape abstraction through sparse representation and feature alignment of object parts in a high-dimensional space. For sparse representation, we devise a sparse latent membership pursuit method that models each object part feature as a sparse convex combination of point features at either the semantic or instance level, promoting part features in the same subspace to exhibit similar semantics. For feature alignment, we customize an attention-based strategy in the feature space to align instance- and semantic-level object part features and reconstruct the input shape using both of them, ensuring geometric reusability and semantic consistency of object parts. To firm up semantic disambiguation, we construct cascade unfrozen learning on geometric parameters of object parts.

Title: Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection

Authors: Wentao Wu, Chenglong Li, Xiao Wang, Bin Luo, Qi Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06948
Pdf URL: https://arxiv.org/pdf/2503.06948
Copy Paste: [[2503.06948]] Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection(https://arxiv.org/abs/2503.06948)
Keywords: large language model
Abstract: Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities, which makes it difficult to achieve accurate semantic and spatial alignments, limiting detection performance. To address this problem, we propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet, which leverages the semantic features extracted from a large language model to guide the progressive semantic and spatial alignment between modalities for multimodal UAV object detection. To employ the powerful semantic representation of LLM, we generate the fine-grained text descriptions of each object category by ChatGPT and then extract the semantic features using the large language model MPNet. Based on the semantic features, we guide the semantic and spatial alignments in a progressive manner as follows. First, we design the Semantic Alignment Module (SAM) to pull the semantic features and multimodal visual features of each object closer, alleviating the semantic differences of objects between modalities. Second, we design the Explicit Spatial alignment Module (ESM) by integrating the semantic relations into the estimation of feature-level offsets, alleviating the coarse spatial misalignment between modalities. Finally, we design the Implicit Spatial alignment Module (ISM), which leverages the cross-modal correlations to aggregate key features from neighboring regions to achieve implicit spatial alignment. Comprehensive experiments on two public multimodal UAV object detection datasets demonstrate that our approach outperforms state-of-the-art multimodal UAV object detectors.

Title: Lshan-1.0 Technical Report

Authors: Haotian Chen, Yanyu Xu, Boyan Wang, Chaoyue Zhao, Xiaoyu Han, Fang Wang, Lizhen Cui, Yonghui Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06949
Pdf URL: https://arxiv.org/pdf/2503.06949
Copy Paste: [[2503.06949]] Lshan-1.0 Technical Report(https://arxiv.org/abs/2503.06949)
Keywords: explainability, large language model
Abstract: In this report, we introduce our first-generation reasoning model, Lshan-1.0, a large language model designed for the highly specialized Chinese legal domain, offering comprehensive capabilities to meet diverse realistic needs. Existing legal LLMs face two primary challenges. Firstly, their design and evaluation are predominantly driven by computer science perspectives, leading to insufficient incorporation of legal expertise and logic, which is crucial for high-precision legal applications, such as handling complex prosecutorial tasks. Secondly, these models often underperform due to a lack of comprehensive training data from the legal domain, limiting their ability to effectively address real-world legal scenarios. To address this, we first compile millions of legal documents covering over 20 types of crimes from 31 provinces in China for model training. From the extensive dataset, we further select high-quality for supervised fine-tuning, ensuring enhanced relevance and precision. The model further undergoes large-scale reinforcement learning without additional supervision, emphasizing the enhancement of its reasoning capabilities and explainability. To validate its effectiveness in complex legal applications, we also conduct human evaluations with legal experts. We develop fine-tuned models based on DeepSeek-R1-Distilled versions, available in three dense configurations: 14B, 32B, and 70B.

Title: CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation

Authors: Runqi Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06950
Pdf URL: https://arxiv.org/pdf/2503.06950
Copy Paste: [[2503.06950]] CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation(https://arxiv.org/abs/2503.06950)
Keywords: security, defense, attack, robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by integrating external knowledge bases. However, this integration introduces a new security threat: adversaries can exploit the retrieval mechanism to inject malicious content into the knowledge base, thereby influencing the generated responses. Based on this attack vector, we propose CtrlRAG, a novel attack method designed for RAG system in the black-box setting, which aligns with real-world scenarios. Unlike existing attack methods, CtrlRAG introduces a perturbation mechanism using Masked Language Model (MLM) to dynamically optimize malicious content in response to changes in the retrieved context. Experimental results demonstrate that CtrlRAG outperforms three baseline methods in both Emotional Manipulation and Hallucination Amplification objectives. Furthermore, we evaluate three existing defense mechanisms, revealing their limited effectiveness against CtrlRAG and underscoring the urgent need for more robust defenses.

Title: Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation

Authors: Xingye Fan, Zhongwen (Rex)Zhang, Yuri Boykov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06954
Pdf URL: https://arxiv.org/pdf/2503.06954
Copy Paste: [[2503.06954]] Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation(https://arxiv.org/abs/2503.06954)
Keywords: robust, segmentation
Abstract: This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.

Title: Motion Anything: Any to Motion Generation

Authors: Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06955
Pdf URL: https://arxiv.org/pdf/2503.06955
Copy Paste: [[2503.06955]] Motion Anything: Any to Motion Generation(https://arxiv.org/abs/2503.06955)
Keywords: diffusion
Abstract: Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website this https URL

Title: Capture Global Feature Statistics for One-Shot Federated Learning

Authors: Zenghao Guan, Yucan Zhou, Xiaoyan Gu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06962
Pdf URL: https://arxiv.org/pdf/2503.06962
Copy Paste: [[2503.06962]] Capture Global Feature Statistics for One-Shot Federated Learning(https://arxiv.org/abs/2503.06962)
Keywords: privacy, attack, federate
Abstract: Traditional Federated Learning (FL) necessitates numerous rounds of communication between the server and clients, posing significant challenges including high communication costs, connection drop risks and susceptibility to privacy attacks. One-shot FL has become a compelling learning paradigm to overcome above drawbacks by enabling the training of a global server model via a single communication round. However, existing one-shot FL methods suffer from expensive computation cost on the server or clients and cannot deal with non-IID (Independent and Identically Distributed) data stably and effectively. To address these challenges, this paper proposes FedCGS, a novel Federated learning algorithm that Capture Global feature Statistics leveraging pre-trained models. With global feature statistics, we achieve training-free and heterogeneity-resistant one-shot FL. Furthermore, we extend its application to personalization scenario, where clients only need execute one extra communication round with server to download global statistics. Extensive experimental results demonstrate the effectiveness of our methods across diverse data heterogeneity settings. Code is available at this https URL.

Title: MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation

Authors: Guanghao Li, Mingzhi Chen, Hao Yu, Shuting Dong, Wenhao Jiang, Ming Tang, Chun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06966
Pdf URL: https://arxiv.org/pdf/2503.06966
Copy Paste: [[2503.06966]] MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation(https://arxiv.org/abs/2503.06966)
Keywords: security, attack, robust
Abstract: Deep learning-based denoising models have been widely employed in vision tasks, functioning as filters to eliminate noise while retaining crucial semantic information. Additionally, they play a vital role in defending against adversarial perturbations that threaten downstream tasks. However, these models can be intrinsically susceptible to adversarial attacks due to their dependence on specific noise assumptions. Existing attacks on denoising models mainly aim at deteriorating visual clarity while neglecting semantic manipulation, rendering them either easily detectable or limited in effectiveness. In this paper, we propose Mutual Information-Guided Attack (MIGA), the first method designed to directly attack deep denoising models by strategically disrupting their ability to preserve semantic content via adversarial perturbations. By minimizing the mutual information between the original and denoised images, a measure of semantic similarity. MIGA forces the denoiser to produce perceptually clean yet semantically altered outputs. While these images appear visually plausible, they encode systematically distorted semantics, revealing a fundamental vulnerability in denoising models. These distortions persist in denoised outputs and can be quantitatively assessed through downstream task performance. We propose new evaluation metrics and systematically assess MIGA on four denoising models across five datasets, demonstrating its consistent effectiveness in disrupting semantic fidelity. Our findings suggest that denoising models are not always robust and can introduce security risks in real-world applications. Code is available in the Supplementary Material.

Title: A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Authors: Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Kohou Wang, Kai Wang, Shiguo Lian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06973
Pdf URL: https://arxiv.org/pdf/2503.06973
Copy Paste: [[2503.06973]] A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis(https://arxiv.org/abs/2503.06973)
Keywords: generative
Abstract: While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.

Title: Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

Authors: Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xiang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, Qing Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06976
Pdf URL: https://arxiv.org/pdf/2503.06976
Copy Paste: [[2503.06976]] Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation(https://arxiv.org/abs/2503.06976)
Keywords: diffusion, segmentation
Abstract: Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this, we pose the following important question: "How can we effectively utilize the knowledge of large pre-trained VFMs to train a small, task-specific model for medical image segmentation when training data is limited?" To address this problem, we propose a novel and generalizable task-specific knowledge distillation framework. Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models, leveraging Low-Rank Adaptation (LoRA) to reduce the computational cost of fine-tuning. Additionally, we incorporate synthetic data generated by diffusion models to augment the transfer set, enhancing model performance in data-limited scenarios. Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation and self-supervised pretraining approaches like MoCo v3 and Masked Autoencoders (MAE). For example, on the KidneyUS dataset, our method achieved a 28% higher Dice score than task-agnostic KD using 80 labeled samples for fine-tuning. On the CHAOS dataset, it achieved an 11% improvement over MAE with 100 labeled samples. These results underscore the potential of task-specific knowledge distillation to train accurate, efficient models for medical image segmentation in data-constrained settings.

Title: Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

Authors: Xinyu Xi, Hua Yang, Shentai Zhang, Yijie Liu, Sijin Sun, Xiuju Fu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.06978
Pdf URL: https://arxiv.org/pdf/2503.06978
Copy Paste: [[2503.06978]] Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition(https://arxiv.org/abs/2503.06978)
Keywords: robust, large language model
Abstract: Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

Title: Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings

Authors: Jonghyun Lee, Dojun Park, Jiwoo Lee, Hoekeon Choi, Sung-Eun Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06980
Pdf URL: https://arxiv.org/pdf/2503.06980
Copy Paste: [[2503.06980]] Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings(https://arxiv.org/abs/2503.06980)
Keywords: large language model
Abstract: This study investigated the multimodal perception of large language models (LLMs), focusing on their ability to capture human-like perceptual strength ratings across sensory modalities. Utilizing perceptual strength ratings as a benchmark, the research compared GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, highlighting the influence of multimodal inputs on grounding and linguistic reasoning. While GPT-4 and GPT-4o demonstrated strong alignment with human evaluations and significant advancements over smaller models, qualitative analyses revealed distinct differences in processing patterns, such as multisensory overrating and reliance on loose semantic associations. Despite integrating multimodal capabilities, GPT-4o did not exhibit superior grounding compared to GPT-4, raising questions about their role in improving human-like grounding. These findings underscore how LLMs' reliance on linguistic patterns can both approximate and diverge from human embodied cognition, revealing limitations in replicating sensory experiences.

Title: Learning Decision Trees as Amortized Structure Inference

Authors: Mohammed Mahfoud, Ghait Boukachab, Michał Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06985
Pdf URL: https://arxiv.org/pdf/2503.06985
Copy Paste: [[2503.06985]] Learning Decision Trees as Amortized Structure Inference(https://arxiv.org/abs/2503.06985)
Keywords: robust, generative
Abstract: Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.

Title: ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration

Authors: Youngseok Kim, Sunwook Hwang, Hyung-Sin Kim, Saewoong Bahk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06986
Pdf URL: https://arxiv.org/pdf/2503.06986
Copy Paste: [[2503.06986]] ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration(https://arxiv.org/abs/2503.06986)
Keywords: privacy, defense, attack, robust
Abstract: The growing use of 3D point cloud data in autonomous vehicles (AVs) has raised serious privacy concerns, particularly due to the sensitive information that can be extracted from 3D data. While model inversion attacks have been widely studied in the context of 2D data, their application to 3D point clouds remains largely unexplored. To fill this gap, we present the first in-depth study of model inversion attacks aimed at restoring 3D point cloud scenes. Our analysis reveals the unique challenges, the inherent sparsity of 3D point clouds and the ambiguity between empty and non-empty voxels after voxelization, which are further exacerbated by the dispersion of non-empty voxels across feature extractor layers. To address these challenges, we introduce ConcreTizer, a simple yet effective model inversion attack designed specifically for voxel-based 3D point cloud data. ConcreTizer incorporates Voxel Occupancy Classification to distinguish between empty and non-empty voxels and Dispersion-Controlled Supervision to mitigate non-empty voxel dispersion. Extensive experiments on widely used 3D feature extractors and benchmark datasets, such as KITTI and Waymo, demonstrate that ConcreTizer concretely restores the original 3D point cloud scene from disrupted 3D feature data. Our findings highlight both the vulnerability of 3D data to inversion attacks and the urgent need for robust defense strategies.

Title: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Authors: Jiho Jin, Woosung Kang, Junho Myung, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06987
Pdf URL: https://arxiv.org/pdf/2503.06987
Copy Paste: [[2503.06987]] Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations(https://arxiv.org/abs/2503.06987)
Keywords: large language model
Abstract: Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

Title: Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs

Authors: Wenzhuo Xu, Zhipeng Wei, Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, Xiangzheng Zhang
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06989
Pdf URL: https://arxiv.org/pdf/2503.06989
Copy Paste: [[2503.06989]] Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs(https://arxiv.org/abs/2503.06989)
Keywords: attack, large language model
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal contents. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on inputs to maximize jailbreak probability. To counteract attacks, we also propose two defensive methods: Jailbreak-Probability-based Finetuning (JPF) and Jailbreak-Probability-based Defensive Noise (JPDN), which minimizes jailbreak probability in the MLLM parameters and input space, respectively. Extensive experiments show that (1) JPA yields improvements (up to 28.38\%) under both white and black box settings compared to previous methods with small perturbation bounds and few iterations. (2) JPF and JPDN significantly reduce jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.

Title: TiGer: Self-Supervised Purification for Time-evolving Graphs

Authors: Hyeonsoo Jo, Jongha Lee, Fanchen Bu, Kijung Shin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06990
Pdf URL: https://arxiv.org/pdf/2503.06990
Copy Paste: [[2503.06990]] TiGer: Self-Supervised Purification for Time-evolving Graphs(https://arxiv.org/abs/2503.06990)
Keywords: robust
Abstract: Time-evolving graphs, such as social and citation networks, often contain noise that distorts structural and temporal patterns, adversely affecting downstream tasks, such as node classification. Existing purification methods focus on static graphs, limiting their ability to account for critical temporal dependencies in dynamic graphs. In this work, we propose TiGer (Time-evolving Graph purifier), a self-supervised method explicitly designed for time-evolving graphs. TiGer assigns two different sub-scores to edges using (1) self-attention for capturing long-term contextual patterns shaped by both adjacent and distant past events of varying significance and (2) statistical distance measures for detecting inconsistency over a short-term period. These sub-scores are used to identify and filter out suspicious (i.e., noise-like) edges through an ensemble strategy, ensuring robustness without requiring noise labels. Our experiments on five real-world datasets show TiGer filters out noise with up to 10.2% higher accuracy and improves node classification performance by up to 5.3%, compared to state-of-the-art methods.

Title: Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

Authors: Yongwoo Kim, Sungmin Cha, Donghyun Kim
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06991
Pdf URL: https://arxiv.org/pdf/2503.06991
Copy Paste: [[2503.06991]] Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols(https://arxiv.org/abs/2503.06991)
Keywords: security, privacy
Abstract: Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation setup from a transfer learning perspective, in which the forget set classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model. Our comprehensive benchmark not only addresses a critical gap between theoretical machine unlearning and practical scenarios, but also establishes a foundation to inspire future research directions in developing genuinely effective unlearning methodologies.

Title: CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

Authors: Shihao Hou, Xinyi Shang, Shreyank N Gowda, Yang Lu, Chao Wu, Yan Yan, Hanzi Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06993
Pdf URL: https://arxiv.org/pdf/2503.06993
Copy Paste: [[2503.06993]] CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model(https://arxiv.org/abs/2503.06993)
Keywords: federate
Abstract: Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.

Title: Public space security management using digital twin technologies

Authors: Stylianos Zindros, Christos Chronis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06996
Pdf URL: https://arxiv.org/pdf/2503.06996
Copy Paste: [[2503.06996]] Public space security management using digital twin technologies(https://arxiv.org/abs/2503.06996)
Keywords: security
Abstract: As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.

Title: SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

Authors: Haoyu Zheng, Qifan Yu, Binghe Yu, Yang Dai, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06998
Pdf URL: https://arxiv.org/pdf/2503.06998
Copy Paste: [[2503.06998]] SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models(https://arxiv.org/abs/2503.06998)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.

Title: Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Authors: Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07002
Pdf URL: https://arxiv.org/pdf/2503.07002
Copy Paste: [[2503.07002]] Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning(https://arxiv.org/abs/2503.07002)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Title: Large Language Models Often Say One Thing and Do Another

Authors: Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixiang Zhou, Le Sun, Yingfei Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07003
Pdf URL: https://arxiv.org/pdf/2503.07003
Copy Paste: [[2503.07003]] Large Language Models Often Say One Thing and Do Another(https://arxiv.org/abs/2503.07003)
Keywords: large language model
Abstract: As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space.

Title: SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video

Authors: Sania Zahan, Ghulam Mubashar Hassan, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07008
Pdf URL: https://arxiv.org/pdf/2503.07008
Copy Paste: [[2503.07008]] SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video(https://arxiv.org/abs/2503.07008)
Keywords: privacy
Abstract: Older people are susceptible to fall due to instability in posture and deteriorating health. Immediate access to medical support can greatly reduce repercussions. Hence, there is an increasing interest in automated fall detection, often incorporated into a smart healthcare system to provide better monitoring. Existing systems focus on wearable devices which are inconvenient or video monitoring which has privacy concerns. Moreover, these systems provide a limited perspective of their generalization ability as they are tested on datasets containing few activities that have wide disparity in the action space and are easy to differentiate. Complex daily life scenarios pose much greater challenges with activities that overlap in action spaces due to similar posture or motion. To overcome these limitations, we propose a fall detection model, coined SDFA, based on human skeletons extracted from low-resolution videos. The use of skeleton data ensures privacy and low-resolution videos ensures low hardware and computational cost. Our model captures discriminative structural displacements and motion trends using unified joint and motion features projected onto a shared high dimensional space. Particularly, the use of separable convolution combined with a powerful GCN architecture provides improved performance. Extensive experiments on five large-scale datasets with a wide range of evaluation settings show that our model achieves competitive performance with extremely low computational complexity and runs faster than existing models.

Title: Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning

Authors: Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07018
Pdf URL: https://arxiv.org/pdf/2503.07018
Copy Paste: [[2503.07018]] Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning(https://arxiv.org/abs/2503.07018)
Keywords: large language model
Abstract: There has been a surge in the use of large language models (LLM) conversational agents to generate responses based on long-term history from multiple sessions. However, existing long-term open-domain dialogue datasets lack complex, real-world personalization and fail to capture implicit reasoning-where relevant information is embedded in subtle, syntactic, or semantically distant connections rather than explicit statements. In such cases, traditional retrieval methods fail to capture relevant context, and long-context modeling also becomes inefficient due to numerous complicated persona-related details. To address this gap, we introduce ImplexConv, a large-scale long-term dataset with 2,500 examples, each containing approximately 100 conversation sessions, designed to study implicit reasoning in personalized dialogues. Additionally, we propose TaciTree, a novel hierarchical tree framework that structures conversation history into multiple levels of summarization. Instead of brute-force searching all data, TaciTree enables an efficient, level-based retrieval process where models refine their search by progressively selecting relevant details. Our experiments demonstrate that TaciTree significantly improves the ability of LLMs to reason over long-term conversations with implicit contextual dependencies.

Title: HybridReg: Robust 3D Point Cloud Registration with Hybrid Motions

Authors: Keyu Du, Hao Xu, Haipeng Li, Hong Qu, Chi-Wing Fu, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07019
Pdf URL: https://arxiv.org/pdf/2503.07019
Copy Paste: [[2503.07019]] HybridReg: Robust 3D Point Cloud Registration with Hybrid Motions(https://arxiv.org/abs/2503.07019)
Keywords: robust, extraction
Abstract: Scene-level point cloud registration is very challenging when considering dynamic foregrounds. Existing indoor datasets mostly assume rigid motions, so the trained models cannot robustly handle scenes with non-rigid motions. On the other hand, non-rigid datasets are mainly object-level, so the trained models cannot generalize well to complex scenes. This paper presents HybridReg, a new approach to 3D point cloud registration, learning uncertainty mask to account for hybrid motions: rigid for backgrounds and non-rigid/rigid for instance-level foregrounds. First, we build a scene-level 3D registration dataset, namely HybridMatch, designed specifically with strategies to arrange diverse deforming foregrounds in a controllable manner. Second, we account for different motion types and formulate a mask-learning module to alleviate the interference of deforming outliers. Third, we exploit a simple yet effective negative log-likelihood loss to adopt uncertainty to guide the feature extraction and correlation computation. To our best knowledge, HybridReg is the first work that exploits hybrid motions for robust point cloud registration. Extensive experiments show HybridReg's strengths, leading it to achieve state-of-the-art performance on both widely-used indoor and outdoor datasets.

Title: Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways

Authors: Yi Liu, Hao Zhou, Wenxiang Shang, Ran Lin, Benlei Cui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07026
Pdf URL: https://arxiv.org/pdf/2503.07026
Copy Paste: [[2503.07026]] Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways(https://arxiv.org/abs/2503.07026)
Keywords: diffusion
Abstract: Erase inpainting, or object removal, aims to precisely remove target objects within masked regions while preserving the overall consistency of the surrounding content. Despite diffusion-based methods have made significant strides in the field of image inpainting, challenges remain regarding the emergence of unexpected objects or artifacts. We assert that the inexact diffusion pathways established by existing standard optimization paradigms constrain the efficacy of object removal. To tackle these challenges, we propose a novel Erase Diffusion, termed EraDiff, aimed at unleashing the potential power of standard diffusion in the context of object removal. In contrast to standard diffusion, the EraDiff adapts both the optimization paradigm and the network to improve the coherence and elimination of the erasure results. We first introduce a Chain-Rectifying Optimization (CRO) paradigm, a sophisticated diffusion process specifically designed to align with the objectives of erasure. This paradigm establishes innovative diffusion transition pathways that simulate the gradual elimination of objects during optimization, allowing the model to accurately capture the intent of object removal. Furthermore, to mitigate deviations caused by artifacts during the sampling pathways, we develop a simple yet effective Self-Rectifying Attention (SRA) mechanism. The SRA calibrates the sampling pathways by altering self-attention activation, allowing the model to effectively bypass artifacts while further enhancing the coherence of the generated content. With this design, our proposed EraDiff achieves state-of-the-art performance on the OpenImages V5 dataset and demonstrates significant superiority in real-world scenarios.

Title: EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Authors: Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, Jiaming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07027
Pdf URL: https://arxiv.org/pdf/2503.07027
Copy Paste: [[2503.07027]] EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer(https://arxiv.org/abs/2503.07027)
Keywords: robust, diffusion, transformer
Abstract: Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.

Title: Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera

Authors: Dong-Hee Paek, Seung-Hyun Kong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07029
Pdf URL: https://arxiv.org/pdf/2503.07029
Copy Paste: [[2503.07029]] Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera(https://arxiv.org/abs/2503.07029)
Keywords: robust
Abstract: Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving (AD). However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions; Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. The code will be available at this https URL.

Title: Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation

Authors: Zhi Qin, Qianhui Gui, Mouxiao Bian, Rui Wang, Hong Ge, Dandan Yao, Ziying Sun, Yuan Zhao, Yu Zhang, Hui Shi, Dongdong Wang, Chenxin Song, Shenghong Ju, Lihao Liu, Junjun He, Jie Xu, Yuan-Cheng Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07032
Pdf URL: https://arxiv.org/pdf/2503.07032
Copy Paste: [[2503.07032]] Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation(https://arxiv.org/abs/2503.07032)
Keywords: large language model
Abstract: Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.

Title: Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams

Authors: Nardine Basta, Conor Atkins, Dali Kaafar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07036
Pdf URL: https://arxiv.org/pdf/2503.07036
Copy Paste: [[2503.07036]] Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams(https://arxiv.org/abs/2503.07036)
Keywords: large language model
Abstract: We present "Bot Wars," a framework using Large Language Models (LLMs) scam-baiters to counter phone scams through simulated adversarial dialogues. Our key contribution is a formal foundation for strategy emergence through chain-of-thought reasoning without explicit optimization. Through a novel two-layer prompt architecture, our framework enables LLMs to craft demographically authentic victim personas while maintaining strategic coherence. We evaluate our approach using a dataset of 3,200 scam dialogues validated against 179 hours of human scam-baiting interactions, demonstrating its effectiveness in capturing complex adversarial dynamics. Our systematic evaluation through cognitive, quantitative, and content-specific metrics shows that GPT-4 excels in dialogue naturalness and persona authenticity, while Deepseek demonstrates superior engagement sustainability.

Title: Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

Authors: Mihcael Green, Matan Levy, Issar Tzachor, Dvir Samuel, Nir Darshan, Rami Ben-Ari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07038
Pdf URL: https://arxiv.org/pdf/2503.07038
Copy Paste: [[2503.07038]] Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization(https://arxiv.org/abs/2503.07038)
Keywords: extraction
Abstract: We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.

Title: TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine

Authors: Tianai Huang, Lu Lu, Jiayuan Chen, Lihao Liu, Junjun He, Yuping Zhao, Wenchao Tang, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07041
Pdf URL: https://arxiv.org/pdf/2503.07041
Copy Paste: [[2503.07041]] TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine(https://arxiv.org/abs/2503.07041)
Keywords: large language model
Abstract: Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.

Title: MambaFlow: A Mamba-Centric Architecture for End-to-End Optical Flow Estimation

Authors: Juntian Du, Yuan Sun, Zhihu Zhou, Pinyi Chen, Runzhe Zhang, Keji Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07046
Pdf URL: https://arxiv.org/pdf/2503.07046
Copy Paste: [[2503.07046]] MambaFlow: A Mamba-Centric Architecture for End-to-End Optical Flow Estimation(https://arxiv.org/abs/2503.07046)
Keywords: transformer
Abstract: Optical flow estimation based on deep learning, particularly the recently proposed top-performing methods that incorporate the Transformer, has demonstrated impressive performance, due to the Transformer's powerful global modeling capabilities. However, the quadratic computational complexity of attention mechanism in the Transformers results in time-consuming training and inference. To alleviate these issues, we propose a novel MambaFlow framework that leverages the high accuracy and efficiency of Mamba architecture to capture features with local correlation while preserving its global information, achieving remarkable performance. To the best of our knowledge, the proposed method is the first Mamba-centric architecture for end-to-end optical flow estimation. It comprises two primary contributed components, both of which are Mamba-centric: a feature enhancement Mamba (FEM) module designed to optimize feature representation quality and a flow propagation Mamba (FPM) module engineered to address occlusion issues by facilitate effective flow information dissemination. Extensive experiments demonstrate that our approach achieves state-of-the-art results, despite encountering occluded regions. On the Sintel benchmark, MambaFlow achieves an EPE all of 1.60, surpassing the leading 1.74 of GMFlow. Additionally, MambaFlow significantly improves inference speed with a runtime of 0.113 seconds, making it 18% faster than GMFlow. The source code will be made publicly available upon acceptance of the paper.

Title: Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion

Authors: Yongle Zhang, Yimin Liu, Qiang Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07047
Pdf URL: https://arxiv.org/pdf/2503.07047
Copy Paste: [[2503.07047]] Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion(https://arxiv.org/abs/2503.07047)
Keywords: diffusion, generative
Abstract: Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.

Title: A Failure-Free and Efficient Discrete Laplace Distribution for Differential Privacy in MPC

Authors: Ivan Tjuawinata, Jiabo Wang, Mengmeng Yang, Shanxiang Lyu, Huaxiong Wang, Kwok-Yan Lam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.07048
Pdf URL: https://arxiv.org/pdf/2503.07048
Copy Paste: [[2503.07048]] A Failure-Free and Efficient Discrete Laplace Distribution for Differential Privacy in MPC(https://arxiv.org/abs/2503.07048)
Keywords: secure, privacy, protect, attack, federate
Abstract: In an MPC-protected distributed computation, although the use of MPC assures data privacy during computation, sensitive information may still be inferred by curious MPC participants from the computation output. This can be observed, for instance, in the inference attacks on either federated learning or a more standard statistical computation with distributed inputs. In this work, we address this output privacy issue by proposing a discrete and bounded Laplace-inspired perturbation mechanism along with a secure realization of this mechanism using MPC. The proposed mechanism strictly adheres to a zero failure probability, overcoming the limitation encountered on other existing bounded and discrete variants of Laplace perturbation. We provide analyses of the proposed differential privacy (DP) perturbation in terms of its privacy and utility. Additionally, we designed MPC protocols to implement this mechanism and presented performance benchmarks based on our experimental setup. The MPC realization of the proposed mechanism exhibits a complexity similar to the state-of-the-art discrete Gaussian mechanism, which can be considered an alternative with comparable efficiency while providing stronger differential privacy guarantee. Moreover, efficiency of the proposed scheme can be further enhanced by performing the noise generation offline while leaving the perturbation phase online.

Title: TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.07050
Pdf URL: https://arxiv.org/pdf/2503.07050
Copy Paste: [[2503.07050]] TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation(https://arxiv.org/abs/2503.07050)
Keywords: interpretability, diffusion, transformer, generative
Abstract: Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.

Title: Generative method for aerodynamic optimization based on classifier-free guided denoising diffusion probabilistic model

Authors: Shisong Deng, Qiang Zhang, Zhengyang Cai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07056
Pdf URL: https://arxiv.org/pdf/2503.07056
Copy Paste: [[2503.07056]] Generative method for aerodynamic optimization based on classifier-free guided denoising diffusion probabilistic model(https://arxiv.org/abs/2503.07056)
Keywords: diffusion, generative
Abstract: Inverse design approach, which directly generates optimal aerodynamic shape with neural network models to meet designated performance targets, has drawn enormous attention. However, the current state-of-the-art inverse design approach for airfoils, which is based on generative adversarial network, demonstrates insufficient precision in its generating and training processes and struggles to reveal the coupling relationship among specified performance indicators. To address these issues, the airfoil inverse design framework based on the classifier-free guided denoising diffusion probabilistic model (CDDPM) is proposed innovatively in this paper. First, the CDDPM can effectively capture the correlations among specific performance indicators and, by adjusting the classifier-free guide coefficient, generate corresponding upper and lower surface pressure coefficient distributions based on designated pressure features. These distributions are then accurately translated into airfoil geometries through a mapping model. Experimental results using classical transonic airfoils as examples show that the inverse design based on CDDPM can generate a variety of pressure coefficient distributions, which enriches the diversity of design results. Compared with current state-of-the-art Wasserstein generative adversarial network methods, CDDPM achieves a 33.6% precision improvement in airfoil generating tasks. Moreover, a practical method to readjust each performance indicator value is proposed based on global optimization algorithm in conjunction with active learning strategy, aiming to provide rational value combination of performance indicators for the inverse design framework. This work is not only suitable for the airfoils design, but also has the capability to apply to optimization process of general product parts targeting selected performance indicators.

Title: Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs

Authors: Amira Guesmi, Bassem Ouni, Muhammad Shafique
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07058
Pdf URL: https://arxiv.org/pdf/2503.07058
Copy Paste: [[2503.07058]] Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs(https://arxiv.org/abs/2503.07058)
Keywords: defense, attack, robust
Abstract: Quantized Neural Networks (QNNs) have emerged as a promising solution for reducing model size and computational costs, making them well-suited for deployment in edge and resource-constrained environments. While quantization is known to disrupt gradient propagation and enhance robustness against pixel-level adversarial attacks, its effectiveness against patch-based adversarial attacks remains largely unexplored. In this work, we demonstrate that adversarial patches remain highly transferable across quantized models, achieving over 70\% attack success rates (ASR) even at extreme bit-width reductions (e.g., 2-bit). This challenges the common assumption that quantization inherently mitigates adversarial threats. To address this, we propose Quantization-Aware Defense Training with Randomization (QADT-R), a novel defense strategy that integrates Adaptive Quantization-Aware Patch Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly transferable patch-based attacks. A-QAPA generates adversarial patches within quantized models, ensuring robustness across different bit-widths. DBWT introduces bit-width cycling during training to prevent overfitting to a specific quantization setting, while GIR injects controlled gradient perturbations to disrupt adversarial optimization. Extensive evaluations on CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to prior defenses such as PBAT and DWQ. Our findings further reveal that PBAT-trained models, while effective against seen patch configurations, fail to generalize to unseen patches due to quantization shift. Additionally, our empirical analysis of gradient alignment, spatial sensitivity, and patch visibility provides insights into the mechanisms that contribute to the high transferability of patch-based attacks in QNNs.

Title: Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Authors: Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, Yu Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07065
Pdf URL: https://arxiv.org/pdf/2503.07065
Copy Paste: [[2503.07065]] Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning(https://arxiv.org/abs/2503.07065)
Keywords: large language model
Abstract: While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.

Title: You Only Debias Once: Towards Flexible Accuracy-Fairness Trade-offs at Inference Time

Authors: Xiaotian Han, Tianlong Chen, Kaixiong Zhou, Zhimeng Jiang, Zhangyang Wang, Xia Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07066
Pdf URL: https://arxiv.org/pdf/2503.07066
Copy Paste: [[2503.07066]] You Only Debias Once: Towards Flexible Accuracy-Fairness Trade-offs at Inference Time(https://arxiv.org/abs/2503.07066)
Keywords: fair
Abstract: Deep neural networks are prone to various bias issues, jeopardizing their applications for high-stake decision-making. Existing fairness methods typically offer a fixed accuracy-fairness trade-off, since the weight of the well-trained model is a fixed point (fairness-optimum) in the weight space. Nevertheless, more flexible accuracy-fairness trade-offs at inference time are practically desired since: 1) stakes of the same downstream task can vary for different individuals, and 2) different regions have diverse laws or regularization for fairness. If using the previous fairness methods, we have to train multiple models, each offering a specific level of accuracy-fairness trade-off. This is often computationally expensive, time-consuming, and difficult to deploy, making it less practical for real-world applications. To address this problem, we propose You Only Debias Once (YODO) to achieve in-situ flexible accuracy-fairness trade-offs at inference time, using a single model that trained only once. Instead of pursuing one individual fixed point (fairness-optimum) in the weight space, we aim to find a "line" in the weight space that connects the accuracy-optimum and fairness-optimum points using a single model. Points (models) on this line implement varying levels of accuracy-fairness trade-offs. At inference time, by manually selecting the specific position of the learned "line", our proposed method can achieve arbitrary accuracy-fairness trade-offs for different end-users and scenarios. Experimental results on tabular and image datasets show that YODO achieves flexible trade-offs between model accuracy and fairness, at ultra-low overheads. For example, if we need $100$ levels of trade-off on the \acse dataset, YODO takes $3.53$ seconds while training $100$ fixed models consumes $425$ seconds. The code is available at this https URL.

Title: DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07067
Pdf URL: https://arxiv.org/pdf/2503.07067
Copy Paste: [[2503.07067]] DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs(https://arxiv.org/abs/2503.07067)
Keywords: large language model
Abstract: Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

Title: XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition

Authors: Chuanming Wang, Henming Mao, Huanhuan Zhang, Huiyuan Fu, Huadong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07075
Pdf URL: https://arxiv.org/pdf/2503.07075
Copy Paste: [[2503.07075]] XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition(https://arxiv.org/abs/2503.07075)
Keywords: extraction
Abstract: Vision-Language Models (VLMs) have demonstrated impressive performance on various visual tasks, yet they still require adaptation on downstream tasks to achieve optimal performance. Recently, various adaptation technologies have been proposed, but we observe they often underperform in fine-grained visual recognition, which requires models to capture subtle yet discriminative features to distinguish similar sub-categories. Current adaptation methods typically rely on an alignment-based prediction framework, \ie the visual feature is compared with each class prompt for similarity calculation as the final prediction, which lacks class interaction during the forward pass. Besides, learning single uni-modal feature further restricts the model's expressive capacity. Therefore, we propose a novel mechanism, XR-VLM, to discover subtle differences by modeling cross-relationships, which specifically excels in scenarios involving multiple features. Our method introduces a unified multi-part visual feature extraction module designed to seamlessly integrate with the diverse backbones inherent in VLMs. Additionally, we develop a multi-part prompt learning module to capture multi-perspective descriptions of sub-categories. To further enhance discriminative capability, we propose a cross relationship modeling pattern that combines visual feature with all class prompt features, enabling a deeper exploration of the relationships between these two modalities. Extensive experiments have been conducted on various fine-grained datasets, and the results demonstrate that our method achieves significant improvements compared to current state-of-the-art approaches. Code will be released.

Title: Linguistic Knowledge Transfer Learning for Speech Enhancement

Authors: Kuo-Hsuan Hung, Xugang Lu, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Yi Lin, Chii-Wann Lin, Yu Tsao
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2503.07078
Pdf URL: https://arxiv.org/pdf/2503.07078
Copy Paste: [[2503.07078]] Linguistic Knowledge Transfer Learning for Speech Enhancement(https://arxiv.org/abs/2503.07078)
Keywords: robust, large language model
Abstract: Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.

Title: On the Generalization of Representation Uncertainty in Earth Observation

Authors: Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail, Xiao Xiang Zhu, Gustau Camps-Valls, Ioannis Papoutsis
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07082
Pdf URL: https://arxiv.org/pdf/2503.07082
Copy Paste: [[2503.07082]] On the Generalization of Representation Uncertainty in Earth Observation(https://arxiv.org/abs/2503.07082)
Keywords: segmentation
Abstract: Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: this https URL.

Title: A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images

Authors: Xiaoyi Liang, Mouxiao Bian, Moxin Chen, Lihao Liu, Junjun He, Jie Xu, Lin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07094
Pdf URL: https://arxiv.org/pdf/2503.07094
Copy Paste: [[2503.07094]] A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images(https://arxiv.org/abs/2503.07094)
Keywords: large language model
Abstract: In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.

Title: OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Authors: Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07098
Pdf URL: https://arxiv.org/pdf/2503.07098
Copy Paste: [[2503.07098]] OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation(https://arxiv.org/abs/2503.07098)
Keywords: segmentation
Abstract: Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

Title: Explainable Android Malware Detection and Malicious Code Localization Using Graph Attention

Authors: Merve Cigdem Ipek, Sevil Sen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.07109
Pdf URL: https://arxiv.org/pdf/2503.07109
Copy Paste: [[2503.07109]] Explainable Android Malware Detection and Malicious Code Localization Using Graph Attention(https://arxiv.org/abs/2503.07109)
Keywords: security, interpretability
Abstract: With the escalating threat of malware, particularly on mobile devices, the demand for effective analysis methods has never been higher. While existing security solutions, including AI-based approaches, offer promise, their lack of transparency constraints the understanding of detected threats. Manual analysis remains time-consuming and reliant on scarce expertise. To address these challenges, we propose a novel approach called XAIDroid that leverages graph neural networks (GNNs) and graph attention mechanisms for automatically locating malicious code snippets within malware. By representing code as API call graphs, XAIDroid captures semantic context and enhances resilience against obfuscation. Utilizing the Graph Attention Model (GAM) and Graph Attention Network (GAT), we assign importance scores to API nodes, facilitating focused attention on critical information for malicious code localization. Evaluation on synthetic and real-world malware datasets demonstrates the efficacy of our approach, achieving high recall and F1-score rates for malicious code localization. The successful implementation of automatic malicious code localization enhances the scalability, interpretability, and reliability of malware analysis.

Title: Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching

Authors: Zhen Zou, Hu Yu, Jie Xiao, Feng Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07120
Pdf URL: https://arxiv.org/pdf/2503.07120
Copy Paste: [[2503.07120]] Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching(https://arxiv.org/abs/2503.07120)
Keywords: diffusion, transformer
Abstract: Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this problem, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing the impact of caching on the generation of intermediate processes. So the lack of exploration provides us with room for analysis and improvement. In this paper, we analyze the impact of caching on the SNR of the diffusion process and discern that feature caching intensifies the denoising procedure, and we further identify this as a more severe exposure bias issue. Drawing on this insight, we introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias (which gives us a higher performance ceiling) diffusion process. Our approach incorporates a comprehensive understanding of caching mechanisms and offers a novel perspective on leveraging caches to expedite diffusion processes. Empirical results indicate that EB-Cache optimizes model performance while concurrently facilitating acceleration. Specifically, in the 50-step generation process, EB-Cache achieves 1.49$\times$ acceleration with 0.63 FID reduction from 3.69, surpassing prior acceleration methods. Code will be available at \href{this https URL}{this https URL}.

Title: A Light Perspective for 3D Object Detection

Authors: Marcelo Eduardo Pederiva, José Mario De Martino, Alessandro Zimmer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07133
Pdf URL: https://arxiv.org/pdf/2503.07133
Copy Paste: [[2503.07133]] A Light Perspective for 3D Object Detection(https://arxiv.org/abs/2503.07133)
Keywords: extraction
Abstract: Comprehending the environment and accurately detecting objects in 3D space are essential for advancing autonomous vehicle technologies. Integrating Camera and LIDAR data has emerged as an effective approach for achieving high accuracy in 3D Object Detection models. However, existing methodologies often rely on heavy, traditional backbones that are computationally demanding. This paper introduces a novel approach that incorporates cutting-edge Deep Learning techniques into the feature extraction process, aiming to create more efficient models without compromising performance. Our model, NextBEV, surpasses established feature extractors like ResNet50 and MobileNetV2. On the KITTI 3D Monocular detection benchmark, NextBEV achieves an accuracy improvement of 2.39%, having less than 10% of the MobileNetV3 parameters. Moreover, we propose changes in LIDAR backbones that decreased the original inference time to 10 ms. Additionally, by fusing these lightweight proposals, we have enhanced the accuracy of the VoxelNet-based model by 2.93% and improved the F1-score of the PointPillar-based model by approximately 20%. Therefore, this work contributes to establishing lightweight and powerful models for individual or fusion techniques, making them more suitable for onboard implementations.

Title: Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis

Authors: Liwei Yang, Xinying Wang, Xiaotang Zhou, Zhengchao Wu, Ningning Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07140
Pdf URL: https://arxiv.org/pdf/2503.07140
Copy Paste: [[2503.07140]] Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis(https://arxiv.org/abs/2503.07140)
Keywords: large language model
Abstract: Implicit sentiment analysis aims to uncover emotions that are subtly expressed, often obscured by ambiguity and figurative language. To accomplish this task, large language models and multi-step reasoning are needed to identify those sentiments that are not explicitly stated. In this study, we propose a novel Dual Reverse Chain Reasoning (DRCR) framework to enhance the performance of implicit sentiment analysis. Inspired by deductive reasoning, the framework consists of three key steps: 1) hypothesize an emotional polarity and derive a reasoning process, 2) negate the initial hypothesis and derive a new reasoning process, and 3) contrast the two reasoning paths to deduce the final sentiment polarity. Building on this, we also introduce a Triple Reverse Chain Reasoning (TRCR) framework to address the limitations of random hypotheses. Both methods combine contrastive mechanisms and multi-step reasoning, significantly improving the accuracy of implicit sentiment classification. Experimental results demonstrate that both approaches outperform existing methods across various model scales, achieving state-of-the-art performance. This validates the effectiveness of combining contrastive reasoning and multi-step reasoning for implicit sentiment analysis.

Title: MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark

Authors: Shengkun Ma, Hao Peng, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07144
Pdf URL: https://arxiv.org/pdf/2503.07144
Copy Paste: [[2503.07144]] MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark(https://arxiv.org/abs/2503.07144)
Keywords: large language model
Abstract: Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.

Title: Controllable 3D Outdoor Scene Generation via Scene Graphs

Authors: Yuheng Liu, Xinke Li, Yuning Zhang, Lu Qi, Xin Li, Wenping Wang, Chongshou Li, Xueting Li, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07152
Pdf URL: https://arxiv.org/pdf/2503.07152
Copy Paste: [[2503.07152]] Controllable 3D Outdoor Scene Generation via Scene Graphs(https://arxiv.org/abs/2503.07152)
Keywords: diffusion
Abstract: Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.

Title: Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

Authors: Jiaming Song, Linqi Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07154
Pdf URL: https://arxiv.org/pdf/2503.07154
Copy Paste: [[2503.07154]] Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms(https://arxiv.org/abs/2503.07154)
Keywords: diffusion, generative
Abstract: Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.

Title: MIRAM: Masked Image Reconstruction Across Multiple Scales for Breast Lesion Risk Prediction

Authors: Hung Q. Vo, Pengyu Yuan, Zheng Yin, Kelvin K. Wong, Chika F. Ezeana, Son T. Ly, Stephen T.C. Wong, Hien V. Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07157
Pdf URL: https://arxiv.org/pdf/2503.07157
Copy Paste: [[2503.07157]] MIRAM: Masked Image Reconstruction Across Multiple Scales for Breast Lesion Risk Prediction(https://arxiv.org/abs/2503.07157)
Keywords: robust, segmentation
Abstract: Self-supervised learning (SSL) has garnered substantial interest within the machine learning and computer vision communities. Two prominent approaches in SSL include contrastive-based learning and self-distillation utilizing cropping augmentation. Lately, masked image modeling (MIM) has emerged as a more potent SSL technique, employing image inpainting as a pretext task. MIM creates a strong inductive bias toward meaningful spatial and semantic understanding. This has opened up new opportunities for SSL to contribute not only to classification tasks but also to more complex applications like object detection and image segmentation. Building upon this progress, our research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks that facilitate the acquisition of robust features. Specifically, we leverage multi-scale image reconstruction from randomly masked input images as the foundation for feature learning. Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details, particularly beneficial for discerning subtle intricacies within medical images. The proposed SSL features help improve classification performance on the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset. In pathology classification, our method demonstrates a 3\% increase in average precision (AP) and a 1\% increase in the area under the receiver operating characteristic curve (AUC) when compared to state-of-the-art (SOTA) algorithms. Moreover, in mass margins classification, our approach achieves a 4\% increase in AP and a 2\% increase in AUC.

Title: Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

Authors: Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07167
Pdf URL: https://arxiv.org/pdf/2503.07167
Copy Paste: [[2503.07167]] Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation(https://arxiv.org/abs/2503.07167)
Keywords: segmentation
Abstract: Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose \textbf{T}emporal \textbf{O}verlapping \textbf{P}rediction (\textbf{TOP}), a self-supervised pre-training method that alleviate the labeling burden for MOS. \textbf{TOP} explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called $\text{mIoU}_{\text{obj}}$ to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that \textbf{TOP} outperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77\% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

Title: Strategies for political-statement segmentation and labelling in unstructured text

Authors: Dmitry Nikolaev, Sean Papay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07179
Pdf URL: https://arxiv.org/pdf/2503.07179
Copy Paste: [[2503.07179]] Strategies for political-statement segmentation and labelling in unstructured text(https://arxiv.org/abs/2503.07179)
Keywords: segmentation
Abstract: Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks -- based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding -- that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.

Title: Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems

Authors: Lia Shahnazaryan, Patrick Simianer, Joern Wuebker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07195
Pdf URL: https://arxiv.org/pdf/2503.07195
Copy Paste: [[2503.07195]] Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems(https://arxiv.org/abs/2503.07195)
Keywords: large language model
Abstract: We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.

Title: QKD-KEM: Hybrid QKD Integration into TLS with OpenSSL Providers

Authors: Javier Blanco-Romero, Pedro Otero García, Daniel Sobral-Blanco, Florina Almenares Mendoza, Ana Fernández Vilas, Rebeca P. Díaz-Redondo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.07196
Pdf URL: https://arxiv.org/pdf/2503.07196
Copy Paste: [[2503.07196]] QKD-KEM: Hybrid QKD Integration into TLS with OpenSSL Providers(https://arxiv.org/abs/2503.07196)
Keywords: security, robust
Abstract: Quantum Key Distribution (QKD) promises information-theoretic security, yet integrating QKD into existing protocols like TLS remains challenging due to its fundamentally different operational model. In this paper, we propose a hybrid QKD-KEM protocol with two distinct integration approaches: a client-initiated flow compatible with both ETSI 004 and 014 specifications, and a server-initiated flow similar to existing work but limited to stateless ETSI 014 APIs. Unlike previous implementations, our work specifically addresses the integration of stateful QKD key exchange protocols (ETSI 004) which is essential for production QKD networks but has remained largely unexplored. By adapting OpenSSL's provider infrastructure to accommodate QKD's pre-distributed key model, we maintain compatibility with current TLS implementations while offering dual layers of security. Performance evaluations demonstrate the feasibility of our hybrid scheme with acceptable overhead, showing that robust security against quantum threats is achievable while addressing the unique requirements of different QKD API specifications.

Title: Effective and Efficient Masked Image Generation Models

Authors: Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07197
Pdf URL: https://arxiv.org/pdf/2503.07197
Copy Paste: [[2503.07197]] Effective and Efficient Masked Image Generation Models(https://arxiv.org/abs/2503.07197)
Keywords: diffusion
Abstract: Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.

Title: How Well Can Differential Privacy Be Audited in One Run?

Authors: Amit Keinan, Moshe Shenfeld, Katrina Ligett
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07199
Pdf URL: https://arxiv.org/pdf/2503.07199
Copy Paste: [[2503.07199]] How Well Can Differential Privacy Be Audited in One Run?(https://arxiv.org/abs/2503.07199)
Keywords: privacy
Abstract: Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. (2024) prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that one-run auditing can only perfectly uncover the true privacy parameters of algorithms whose structure allows the effects of individual data elements to be isolated. Our characterization helps reveal how and when one-run auditing is still a promising technique for auditing real machine learning algorithms, despite these fundamental gaps.

Title: A Formally Verified Lightning Network

Authors: Grzegorz Fabiański, Rafał Stefański, Orfeas Stefanos Thyfronitis Litos
Subjects: cs.CR, cs.LO
Abstract URL: https://arxiv.org/abs/2503.07200
Pdf URL: https://arxiv.org/pdf/2503.07200
Copy Paste: [[2503.07200]] A Formally Verified Lightning Network(https://arxiv.org/abs/2503.07200)
Keywords: security
Abstract: In this work we use formal verification to prove that the Lightning Network (LN), the most prominent scaling technique for Bitcoin, always safeguards the funds of honest users. We provide a custom implementation of (a simplification of) LN, express the desired security goals and, for the first time, we provide a machine checkable proof that they are upheld under every scenario, all in an integrated fashion. We build our system using the Why3 platform.

Title: Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation

Authors: Ruochen Pi, Lianlei Shan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07209
Pdf URL: https://arxiv.org/pdf/2503.07209
Copy Paste: [[2503.07209]] Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation(https://arxiv.org/abs/2503.07209)
Keywords: diffusion, segmentation
Abstract: Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.

Title: FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates

Authors: Sangwoo Park, Seanie Lee, Byungjoo Kim, Sung Ju Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07216
Pdf URL: https://arxiv.org/pdf/2503.07216
Copy Paste: [[2503.07216]] FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates(https://arxiv.org/abs/2503.07216)
Keywords: privacy, attack, robust, membership infer, federate
Abstract: Federated Learning (FL) is a widely used framework for training models in a decentralized manner, ensuring that the central server does not have direct access to data from local clients. However, this approach may still fail to fully preserve data privacy, as models from local clients are exposed to the central server during the aggregation process. This issue becomes even more critical when training vision-language models (VLMs) with FL, as VLMs can easily memorize training data instances, making them vulnerable to membership inference attacks (MIAs). To address this challenge, we propose the FedRand framework, which avoids disclosing the full set of client parameters. In this framework, each client randomly selects subparameters of Low-Rank Adaptation (LoRA) from the server and keeps the remaining counterparts of the LoRA weights as private parameters. After training both parameters on the client's private dataset, only the non-private client parameters are sent back to the server for aggregation. This approach mitigates the risk of exposing client-side VLM parameters, thereby enhancing data privacy. We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines while achieving accuracy comparable to methods that communicate full LoRA parameters across several benchmark datasets.

Title: A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features

Authors: Luigi Russo, Antonietta Sorriso, Silvia Liberata Ullo, Paolo Gamba
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.07230
Pdf URL: https://arxiv.org/pdf/2503.07230
Copy Paste: [[2503.07230]] A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features(https://arxiv.org/abs/2503.07230)
Keywords: transformer
Abstract: Land Cover (LC) mapping using satellite imagery is critical for environmental monitoring and management. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have revolutionized this field by enhancing the accuracy of classification tasks. In this work, a novel approach combining a transformer-based Swin-Unet architecture with seasonal synthesized spatio-temporal images has been employed to classify LC types using spatio-temporal features extracted from Sentinel-1 (S1) Synthetic Aperture Radar (SAR) data, organized into seasonal clusters. The study focuses on three distinct regions - Amazonia, Africa, and Siberia - and evaluates the model performance across diverse ecoregions within these areas. By utilizing seasonal feature sequences instead of dense temporal sequences, notable performance improvements have been achieved, especially in regions with temporal data gaps like Siberia, where S1 data distribution is uneven and non-uniform. The results demonstrate the effectiveness and the generalization capabilities of the proposed methodology in achieving high overall accuracy (O.A.) values, even in regions with limited training data.

Title: Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios

Authors: Chenglu Pan, Xiaogang Xu, Ganggui Ding, Yunke Zhang, Wenbo Li, Jiarong Xu, Qingbiao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07232
Pdf URL: https://arxiv.org/pdf/2503.07232
Copy Paste: [[2503.07232]] Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios(https://arxiv.org/abs/2503.07232)
Keywords: robust, diffusion
Abstract: Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.

Title: CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting

Authors: Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, Zhenning Li
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07234
Pdf URL: https://arxiv.org/pdf/2503.07234
Copy Paste: [[2503.07234]] CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting(https://arxiv.org/abs/2503.07234)
Keywords: robust, large language model
Abstract: Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.

Title: Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code

Authors: Gangyang Li, Xiuwei Shang, Shaoyin Cheng, Junqi Zhang, Li Hu, Xu Zhu, Weiming Zhang, Nenghai Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.07243
Pdf URL: https://arxiv.org/pdf/2503.07243
Copy Paste: [[2503.07243]] Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code(https://arxiv.org/abs/2503.07243)
Keywords: security
Abstract: Type recovery is a crucial step in binary code analysis, holding significant importance for reverse engineering and various security applications. Existing works typically simply target type identifiers within binary code and achieve type recovery by analyzing variable characteristics within functions. However, we find that the types in real-world binary programs are more complex and often follow specific distribution patterns. In this paper, to gain a profound understanding of the variable type recovery problem in binary code, we first conduct a comprehensive empirical study. We utilize the TYDA dataset, which includes 163,643 binary programs across four architectures and four compiler optimization options, fully reflecting the complexity and diversity of real-world programs. We carefully study the unique patterns that characterize types and variables in binary code, and also investigate the impact of compiler optimizations on them, yielding many valuable insights. Based on our empirical findings, we propose ByteTR, a framework for recovering variable types in binary code. We decouple the target type set to address the issue of unbalanced type distribution and perform static program analysis to tackle the impact of compiler optimizations on variable storage. In light of the ubiquity of variable propagation across functions observed in our study, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery. We conduct extensive experiments to evaluate the performance of ByteTR. The results demonstrate that ByteTR leads state-of-the-art works in both effectiveness and efficiency. Moreover, in real CTF challenge case, the pseudo code optimized by ByteTR significantly improves readability, surpassing leading tools IDA and Ghidra.

Title: Semantic Communications with Computer Vision Sensing for Edge Video Transmission

Authors: Yubo Peng, Luping Xiang, Kun Yang, Kezhi Wang, Merouane Debbah
Subjects: cs.CV, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.07252
Pdf URL: https://arxiv.org/pdf/2503.07252
Copy Paste: [[2503.07252]] Semantic Communications with Computer Vision Sensing for Edge Video Transmission(https://arxiv.org/abs/2503.07252)
Keywords: segmentation
Abstract: Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.

Title: AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis

Authors: Zhangyu Lai, Yilin Lu, Xinyang Li, Jianghang Lin, Yansong Qu, Liujuan Cao, Ming Li, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07253
Pdf URL: https://arxiv.org/pdf/2503.07253
Copy Paste: [[2503.07253]] AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis(https://arxiv.org/abs/2503.07253)
Keywords: diffusion
Abstract: While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.

Title: COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Authors: Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07259
Pdf URL: https://arxiv.org/pdf/2503.07259
Copy Paste: [[2503.07259]] COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition(https://arxiv.org/abs/2503.07259)
Keywords: privacy
Abstract: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at this https URL .

Title: Customized SAM 2 for Referring Remote Sensing Image Segmentation

Authors: Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07266
Pdf URL: https://arxiv.org/pdf/2503.07266
Copy Paste: [[2503.07266]] Customized SAM 2 for Referring Remote Sensing Image Segmentation(https://arxiv.org/abs/2503.07266)
Keywords: segmentation
Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM 2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM 2, a novel framework that adapts SAM 2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we first employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. Then, we design a bidirectional hierarchical fusion module to adapt SAM 2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. Additionally, a mask prompt generator is introduced to take the visual embeddings and class tokens as input and produce a pseudo-mask as the dense prompt of SAM 2. To further refine segmentation, we introduce a text-guided boundary loss to optimize segmentation boundaries by computing text-weighted gradient differences. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM 2 achieves state-of-the-art performance.

Title: Federated Learning in NTNs: Design, Architecture and Challenges

Authors: Amin Farajzadeh, Animesh Yadav, Halim Yanikomeroglu
Subjects: cs.LG, cs.AI, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2503.07272
Pdf URL: https://arxiv.org/pdf/2503.07272
Copy Paste: [[2503.07272]] Federated Learning in NTNs: Design, Architecture and Challenges(https://arxiv.org/abs/2503.07272)
Keywords: privacy, federate
Abstract: Non-terrestrial networks (NTNs) are emerging as a core component of future 6G communication systems, providing global connectivity and supporting data-intensive applications. In this paper, we propose a distributed hierarchical federated learning (HFL) framework within the NTN architecture, leveraging a high altitude platform station (HAPS) constellation as intermediate distributed FL servers. Our framework integrates both low-Earth orbit (LEO) satellites and ground clients in the FL training process while utilizing geostationary orbit (GEO) and medium-Earth orbit (MEO) satellites as relays to exchange FL global models across other HAPS constellations worldwide, enabling seamless, global-scale learning. The proposed framework offers several key benefits: (i) enhanced privacy through the decentralization of the FL mechanism by leveraging the HAPS constellation, (ii) improved model accuracy and reduced training loss while balancing latency, (iii) increased scalability of FL systems through ubiquitous connectivity by utilizing MEO and GEO satellites, and (iv) the ability to use FL data, such as resource utilization metrics, to further optimize the NTN architecture from a network management perspective. A numerical study demonstrates the proposed framework's effectiveness, with improved model accuracy, reduced training loss, and efficient latency management. The article also includes a brief review of FL in NTNs and highlights key challenges and future research directions.

Title: Efficient Distillation of Classifier-Free Guidance using Adapters

Authors: Cristian Perez Jensen, Seyedmorteza Sadat
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07274
Pdf URL: https://arxiv.org/pdf/2503.07274
Copy Paste: [[2503.07274]] Efficient Distillation of Classifier-Free Guidance using Adapters(https://arxiv.org/abs/2503.07274)
Keywords: diffusion
Abstract: While classifier-free guidance (CFG) is essential for conditional diffusion models, it doubles the number of neural function evaluations (NFEs) per inference step. To mitigate this inefficiency, we introduce adapter guidance distillation (AGD), a novel approach that simulates CFG in a single forward pass. AGD leverages lightweight adapters to approximate CFG, effectively doubling the sampling speed while maintaining or even improving sample quality. Unlike prior guidance distillation methods that tune the entire model, AGD keeps the base model frozen and only trains minimal additional parameters ($\sim$2%) to significantly reduce the resource requirement of the distillation phase. Additionally, this approach preserves the original model weights and enables the adapters to be seamlessly combined with other checkpoints derived from the same base model. We also address a key mismatch between training and inference in existing guidance distillation methods by training on CFG-guided trajectories instead of standard diffusion trajectories. Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models ($\sim$2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method.

Title: A Systematic Review of ECG Arrhythmia Classification: Adherence to Standards, Fair Evaluation, and Embedded Feasibility

Authors: Guilherme Silva, Pedro Silva, Gladston Moreira, Vander Freitas, Jadson Gertrudes, Eduardo Luz
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07276
Pdf URL: https://arxiv.org/pdf/2503.07276
Copy Paste: [[2503.07276]] A Systematic Review of ECG Arrhythmia Classification: Adherence to Standards, Fair Evaluation, and Embedded Feasibility(https://arxiv.org/abs/2503.07276)
Keywords: robust, fair
Abstract: The classification of electrocardiogram (ECG) signals is crucial for early detection of arrhythmias and other cardiac conditions. However, despite advances in machine learning, many studies fail to follow standardization protocols, leading to inconsistencies in performance evaluation and real-world applicability. Additionally, hardware constraints essential for practical deployment, such as in pacemakers, Holter monitors, and wearable ECG patches, are often overlooked. Since real-world impact depends on feasibility in resource-constrained devices, ensuring efficient deployment is critical for continuous monitoring. This review systematically analyzes ECG classification studies published between 2017 and 2024, focusing on those adhering to the E3C (Embedded, Clinical, and Comparative Criteria), which include inter-patient paradigm implementation, compliance with Association for the Advancement of Medical Instrumentation (AAMI) recommendations, and model feasibility for embedded systems. While many studies report high accuracy, few properly consider patient-independent partitioning and hardware limitations. We identify state-of-the-art methods meeting E3C criteria and conduct a comparative analysis of accuracy, inference time, energy consumption, and memory usage. Finally, we propose standardized reporting practices to ensure fair comparisons and practical applicability of ECG classification models. By addressing these gaps, this study aims to guide future research toward more robust and clinically viable ECG classification systems.

Title: A Graph-based Verification Framework for Fact-Checking

Authors: Yani Huang, Richong Zhang, Zhijie Nie, Junfan Chen, Xuefeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07282
Pdf URL: https://arxiv.org/pdf/2503.07282
Copy Paste: [[2503.07282]] A Graph-based Verification Framework for Fact-Checking(https://arxiv.org/abs/2503.07282)
Keywords: large language model
Abstract: Fact-checking plays a crucial role in combating misinformation. Existing methods using large language models (LLMs) for claim decomposition face two key limitations: (1) insufficient decomposition, introducing unnecessary complexity to the verification process, and (2) ambiguity of mentions, leading to incorrect verification results. To address these challenges, we suggest introducing a claim graph consisting of triplets to address the insufficient decomposition problem and reduce mention ambiguity through graph structure. Based on this core idea, we propose a graph-based framework, GraphFC, for fact-checking. The framework features three key components: graph construction, which builds both claim and evidence graphs; graph-guided planning, which prioritizes the triplet verification order; and graph-guided checking, which verifies the triples one by one between claim and evidence graphs. Extensive experiments show that GraphFC enables fine-grained decomposition while resolving referential ambiguities through relational constraints, achieving state-of-the-art performance across three datasets.

Title: Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification

Authors: Thomas Boucher, Evangelos B. Mazomenos
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07294
Pdf URL: https://arxiv.org/pdf/2503.07294
Copy Paste: [[2503.07294]] Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification(https://arxiv.org/abs/2503.07294)
Keywords: transformer
Abstract: Quantum vision transformers (QViTs) build on vision transformers (ViTs) by replacing linear layers within the self-attention mechanism with parameterised quantum neural networks (QNNs), harnessing quantum mechanical properties to improve feature representation. This hybrid approach aims to achieve superior performance, with significantly reduced model complexity as a result of the enriched feature representation, requiring fewer parameters. This paper proposes a novel QViT model for biomedical image classification and investigates its performance against comparable ViTs across eight diverse datasets, encompassing various modalities and classification tasks. We assess models trained from scratch and those pre-trained using knowledge distillation (KD) from high-quality teacher models. Our findings demonstrate that QViTs outperform comparable ViTs with average ROC AUC (0.863 vs 0.846) and accuracy (0.710 vs 0.687) when trained from scratch, and even compete with state-of-the-art classical models in multiple tasks, whilst being significantly more efficient (89% reduction in GFLOPs and 99.99% in parameter number). Additionally, we find that QViTs and ViTs respond equally well to KD, with QViT pre-training performance scaling with model complexity. This is the first investigation into the efficacy of deploying QViTs with KD for computer-aided diagnosis. Our results highlight the enormous potential of quantum machine learning (QML) in biomedical image analysis.

Title: Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies

Authors: Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07306
Pdf URL: https://arxiv.org/pdf/2503.07306
Copy Paste: [[2503.07306]] Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies(https://arxiv.org/abs/2503.07306)
Keywords: robust, large language model
Abstract: The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.

Title: AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models

Authors: Bo Huang, Wenlun Xu, Qizhuo Han, Haodong Jing, Ying Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07307
Pdf URL: https://arxiv.org/pdf/2503.07307
Copy Paste: [[2503.07307]] AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models(https://arxiv.org/abs/2503.07307)
Keywords: diffusion
Abstract: While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.

Title: Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions

Authors: Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07315
Pdf URL: https://arxiv.org/pdf/2503.07315
Copy Paste: [[2503.07315]] Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions(https://arxiv.org/abs/2503.07315)
Keywords: robust
Abstract: Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.

Title: Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

Authors: Hao Zhou, Guergana Savova, Lijing Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07329
Pdf URL: https://arxiv.org/pdf/2503.07329
Copy Paste: [[2503.07329]] Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models(https://arxiv.org/abs/2503.07329)
Keywords: large language model
Abstract: The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model this http URL this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro-level effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.

Title: Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

Authors: Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07334
Pdf URL: https://arxiv.org/pdf/2503.07334
Copy Paste: [[2503.07334]] Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment(https://arxiv.org/abs/2503.07334)
Keywords: large language model
Abstract: We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, . This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.

Title: LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction

Authors: Kangan Qian, Jinyu Miao, Ziang Luo, Zheng Fu, and Jinchen Li, Yining Shi, Yunlong Wang, Kun Jiang, Mengmeng Yang, Diange Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07367
Pdf URL: https://arxiv.org/pdf/2503.07367
Copy Paste: [[2503.07367]] LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction(https://arxiv.org/abs/2503.07367)
Keywords: robust
Abstract: Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.

Title: Probabilistic Segmentation for Robust Field of View Estimation

Authors: R. Spencer Hallyburton, David Hunt, Yiwei He, Judy He, Miroslav Pajic
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07375
Pdf URL: https://arxiv.org/pdf/2503.07375
Copy Paste: [[2503.07375]] Probabilistic Segmentation for Robust Field of View Estimation(https://arxiv.org/abs/2503.07375)
Keywords: security, attack, robust, segmentation
Abstract: Attacks on sensing and perception threaten the safe deployment of autonomous vehicles (AVs). Security-aware sensor fusion helps mitigate threats but requires accurate field of view (FOV) estimation which has not been evaluated autonomy. To address this gap, we adapt classical computer graphics algorithms to develop the first autonomy-relevant FOV estimators and create the first datasets with ground truth FOV labels. Unfortunately, we find that these approaches are themselves highly vulnerable to attacks on sensing. To improve robustness of FOV estimation against attacks, we propose a learning-based segmentation model that captures FOV features, integrates Monte Carlo dropout (MCD) for uncertainty quantification, and performs anomaly detection on confidence maps. We illustrate through comprehensive evaluations attack resistance and strong generalization across environments. Architecture trade studies demonstrate the model is feasible for real-time deployment in multiple applications.

Title: Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs

Authors: Gonzalo Mancera, Daniel de Alcala, Julian Fierrez, Ruben Tolosana, Aythami Morales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07384
Pdf URL: https://arxiv.org/pdf/2503.07384
Copy Paste: [[2503.07384]] Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs(https://arxiv.org/abs/2503.07384)
Keywords: privacy, robust, membership infer, transformer
Abstract: This work adapts and studies the gradient-based Membership Inference Test (gMINT) to the classification of text based on LLMs. MINT is a general approach intended to determine if given data was used for training machine learning models, and this work focuses on its application to the domain of Natural Language Processing. Using gradient-based analysis, the MINT model identifies whether particular data samples were included during the language model training phase, addressing growing concerns about data privacy in machine learning. The method was evaluated in seven Transformer-based models and six datasets comprising over 2.5 million sentences, focusing on text classification tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores between 85% and 99%, depending on data size and model architecture. These findings highlight MINTs potential as a scalable and reliable tool for auditing machine learning models, ensuring transparency, safeguarding sensitive data, and fostering ethical compliance in the deployment of AI/NLP technologies.

Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

Authors: Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07389
Pdf URL: https://arxiv.org/pdf/2503.07389
Copy Paste: [[2503.07389]] TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.07389)
Keywords: diffusion
Abstract: Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: this http URL. CAUTION: This paper includes model-generated content that may contain offensive material.

Title: PersonaBooth: Personalized Text-to-Motion Generation

Authors: Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, Hyung Jin Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07390
Pdf URL: https://arxiv.org/pdf/2503.07390
Copy Paste: [[2503.07390]] PersonaBooth: Personalized Text-to-Motion Generation(https://arxiv.org/abs/2503.07390)
Keywords: diffusion
Abstract: This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.

Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

Authors: Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07392
Pdf URL: https://arxiv.org/pdf/2503.07392
Copy Paste: [[2503.07392]] SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models(https://arxiv.org/abs/2503.07392)
Keywords: privacy, diffusion
Abstract: Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: this https URL.

Title: Revisiting Noise in Natural Language Processing for Computational Social Science

Authors: Nadav Borenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07395
Pdf URL: https://arxiv.org/pdf/2503.07395
Copy Paste: [[2503.07395]] Revisiting Noise in Natural Language Processing for Computational Social Science(https://arxiv.org/abs/2503.07395)
Keywords: large language model
Abstract: Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.

Title: Q-MARL: A quantum-inspired algorithm using neural message passing for large-scale multi-agent reinforcement learning

Authors: Kha Vo, Chin-Teng Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07397
Pdf URL: https://arxiv.org/pdf/2503.07397
Copy Paste: [[2503.07397]] Q-MARL: A quantum-inspired algorithm using neural message passing for large-scale multi-agent reinforcement learning(https://arxiv.org/abs/2503.07397)
Keywords: robust
Abstract: Inspired by a graph-based technique for predicting molecular properties in quantum chemistry -- atoms' position within molecules in three-dimensional space -- we present Q-MARL, a completely decentralised learning architecture that supports very large-scale multi-agent reinforcement learning scenarios without the need for strong assumptions like common rewards or agent order. The key is to treat each agent as relative to its surrounding agents in an environment that is presumed to change dynamically. Hence, in each time step, an agent is the centre of its own neighbourhood and also a neighbour to many other agents. Each role is formulated as a sub-graph, and each sub-graph is used as a training sample. A message-passing neural network supports full-scale vertex and edge interaction within a local neighbourhood, while a parameter governing the depth of the sub-graphs eases the training burden. During testing, an agent's actions are locally ensembled across all the sub-graphs that contain it, resulting in robust decisions. Where other approaches struggle to manage 50 agents, Q-MARL can easily marshal thousands. A detailed theoretical analysis proves improvement and convergence, and simulations with the typical collaborative and competitive scenarios show dramatically faster training speeds and reduced training losses.

Title: Keeping Representation Similarity in Finetuning for Medical Image Analysis

Authors: Wenqiang Zu, Shenghao Xie, Hao Chen, Yiming Liang, Lei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07399
Pdf URL: https://arxiv.org/pdf/2503.07399
Copy Paste: [[2503.07399]] Keeping Representation Similarity in Finetuning for Medical Image Analysis(https://arxiv.org/abs/2503.07399)
Keywords: robust
Abstract: Foundation models pretrained on large-scale natural images have been widely used to adapt to medical image analysis through finetuning. This is largely attributed to pretrained representations capturing universal, robust, and generalizable features, which can be reutilized by downstream tasks. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of foundation model's original abilities, e.g., generalizability. In this paper, we argue that pretrained representations can be well preserved while still effectively adapting to downstream tasks. We study this by proposing a new finetuning method RepSim, which minimizes the distance between pretrained and finetuned representations via constraining learnable orthogonal manifold based on similarity invariance. Compared to standard finetuning methods, e.g., full finetuning, our method improves representation similarity by over 30% while maintaining competitive accuracy, and reduces sharpness by 42% across five medical image classification datasets. The code will be released.

Title: REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Authors: Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07413
Pdf URL: https://arxiv.org/pdf/2503.07413
Copy Paste: [[2503.07413]] REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding(https://arxiv.org/abs/2503.07413)
Keywords: robust, interpretability, large language model, segmentation
Abstract: Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at this https URL.

Title: TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

Authors: Shaobin Zhuang, Yiwei Guo, Yanbo Ding, Kunchang Li, Xinyuan Chen, Yaohui Wang, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07416
Pdf URL: https://arxiv.org/pdf/2503.07416
Copy Paste: [[2503.07416]] TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision(https://arxiv.org/abs/2503.07416)
Keywords: diffusion
Abstract: Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.

Title: AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Authors: Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07418
Pdf URL: https://arxiv.org/pdf/2503.07418
Copy Paste: [[2503.07418]] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion(https://arxiv.org/abs/2503.07418)
Keywords: diffusion
Abstract: The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.

Title: RePO: ReLU-based Preference Optimization

Authors: Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07426
Pdf URL: https://arxiv.org/pdf/2503.07426
Copy Paste: [[2503.07426]] RePO: ReLU-based Preference Optimization(https://arxiv.org/abs/2503.07426)
Keywords: large language model
Abstract: Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $\beta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($\beta$, $\gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $\beta$ via two advances: (1) retaining SimPO's reference-free margins but removing $\beta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($\beta \to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

Title: Open-Set Gait Recognition from Sparse mmWave Radar Point Clouds

Authors: Riccardo Mazzieri, Jacopo Pegoraro, Michele Rossi
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.07435
Pdf URL: https://arxiv.org/pdf/2503.07435
Copy Paste: [[2503.07435]] Open-Set Gait Recognition from Sparse mmWave Radar Point Clouds(https://arxiv.org/abs/2503.07435)
Keywords: privacy, robust
Abstract: The adoption of Millimeter-Wave (mmWave) radar devices for human sensing, particularly gait recognition, has recently gathered significant attention due to their efficiency, resilience to environmental conditions, and privacy-preserving nature. In this work, we tackle the challenging problem of Open-set Gait Recognition (OSGR) from sparse mmWave radar point clouds. Unlike most existing research, which assumes a closed-set scenario, our work considers the more realistic open-set case, where unknown subjects might be present at inference time, and should be correctly recognized by the system. Point clouds are well-suited for edge computing applications with resource constraints, but are more significantly affected by noise and random fluctuations than other representations, like the more common micro-Doppler signature. This is the first work addressing open-set gait recognition with sparse point cloud data. To do so, we propose a novel neural network architecture that combines supervised classification with unsupervised reconstruction of the point clouds, creating a robust, rich, and highly regularized latent space of gait features. To detect unknown subjects at inference time, we introduce a probabilistic novelty detection algorithm that leverages the structured latent space and offers a tunable trade-off between inference speed and prediction accuracy. Along with this paper, we release mmGait10, an original human gait dataset featuring over five hours of measurements from ten subjects, under varied walking modalities. Extensive experimental results show that our solution attains F1-Score improvements by 24% over state-of-the-art methods, on average, and across multiple openness levels.

Title: Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

Authors: Dylan J. Foster, Zakaria Mhammedi, Dhruv Rohatgi
Subjects: cs.LG, cs.AI, cs.CL, math.ST
Abstract URL: https://arxiv.org/abs/2503.07453
Pdf URL: https://arxiv.org/pdf/2503.07453
Copy Paste: [[2503.07453]] Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration(https://arxiv.org/abs/2503.07453)
Keywords: generative
Abstract: Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.

Title: Anatomy-Aware Conditional Image-Text Retrieval

Authors: Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07456
Pdf URL: https://arxiv.org/pdf/2503.07456
Copy Paste: [[2503.07456]] Anatomy-Aware Conditional Image-Text Retrieval(https://arxiv.org/abs/2503.07456)
Keywords: explainability
Abstract: Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.

Title: LLMs syntactically adapt their language use to their conversational partner

Authors: Florian Kandra, Vera Demberg, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07457
Pdf URL: https://arxiv.org/pdf/2503.07457
Copy Paste: [[2503.07457]] LLMs syntactically adapt their language use to their conversational partner(https://arxiv.org/abs/2503.07457)
Keywords: large language model
Abstract: It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

Title: MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Authors: Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07459
Pdf URL: https://arxiv.org/pdf/2503.07459
Copy Paste: [[2503.07459]] MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning(https://arxiv.org/abs/2503.07459)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at this https URL.

Title: Learning to Localize Leakage of Cryptographic Sensitive Variables

Authors: Jimmy Gammell, Anand Raghunathan, Abolfazl Hashemi, Kaushik Roy
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07464
Pdf URL: https://arxiv.org/pdf/2503.07464
Copy Paste: [[2503.07464]] Learning to Localize Leakage of Cryptographic Sensitive Variables(https://arxiv.org/abs/2503.07464)
Keywords: secure, defense, attack
Abstract: While cryptographic algorithms such as the ubiquitous Advanced Encryption Standard (AES) are secure, *physical implementations* of these algorithms in hardware inevitably 'leak' sensitive data such as cryptographic keys. A particularly insidious form of leakage arises from the fact that hardware consumes power and emits radiation in a manner that is statistically associated with the data it processes and the instructions it executes. Supervised deep learning has emerged as a state-of-the-art tool for carrying out *side-channel attacks*, which exploit this leakage by learning to map power/radiation measurements throughout encryption to the sensitive data operated on during that encryption. In this work we develop a principled deep learning framework for determining the relative leakage due to measurements recorded at different points in time, in order to inform *defense* against such attacks. This information is invaluable to cryptographic hardware designers for understanding *why* their hardware leaks and how they can mitigate it (e.g. by indicating the particular sections of code or electronic components which are responsible). Our framework is based on an adversarial game between a family of classifiers trained to estimate the conditional distributions of sensitive data given subsets of measurements, and a budget-constrained noise distribution which probabilistically erases individual measurements to maximize the loss of these classifiers. We demonstrate our method's efficacy and ability to overcome limitations of prior work through extensive experimental comparison with 8 baseline methods using 3 evaluation metrics and 6 publicly-available power/EM trace datasets from AES, ECC and RSA implementations. We provide an open-source PyTorch implementation of these experiments.

Title: YOLOE: Real-Time Seeing Anything

Authors: Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07465
Pdf URL: https://arxiv.org/pdf/2503.07465
Copy Paste: [[2503.07465]] YOLOE: Real-Time Seeing Anything(https://arxiv.org/abs/2503.07465)
Keywords: segmentation
Abstract: Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at this https URL.

Title: Efficient Membership Inference Attacks by Bayesian Neural Network

Authors: Zhenlong Liu, Wenyu Jiang, Feng Zhou, Hongxin Wei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07482
Pdf URL: https://arxiv.org/pdf/2503.07482
Copy Paste: [[2503.07482]] Efficient Membership Inference Attacks by Bayesian Neural Network(https://arxiv.org/abs/2503.07482)
Keywords: attack, membership infer
Abstract: Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Previous attacks often utilize multiple reference models to approximate the conditional score distribution, leading to significant computational overhead. While recent work leverages quantile regression to estimate conditional thresholds, it fails to capture epistemic uncertainty, resulting in bias in low-density regions. In this work, we propose a novel approach - Bayesian Membership Inference Attack (BMIA), which performs conditional attack through Bayesian inference. In particular, we transform a trained reference model into Bayesian neural networks by Laplace approximation, enabling the direct estimation of the conditional score distribution by probabilistic model parameters. Our method addresses both epistemic and aleatoric uncertainty with only a reference model, enabling efficient and powerful MIA. Extensive experiments on five datasets demonstrate the effectiveness and efficiency of BMIA.

Title: Poisoning Attacks to Local Differential Privacy Protocols for Trajectory Data

Authors: I-Jung Hsu, Chih-Hsun Lin, Chia-Mu Yu, Sy-Yen Kuo, Chun-Ying Huang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07483
Pdf URL: https://arxiv.org/pdf/2503.07483
Copy Paste: [[2503.07483]] Poisoning Attacks to Local Differential Privacy Protocols for Trajectory Data(https://arxiv.org/abs/2503.07483)
Keywords: privacy, defense, attack, robust
Abstract: Trajectory data, which tracks movements through geographic locations, is crucial for improving real-world applications. However, collecting such sensitive data raises considerable privacy concerns. Local differential privacy (LDP) offers a solution by allowing individuals to locally perturb their trajectory data before sharing it. Despite its privacy benefits, LDP protocols are vulnerable to data poisoning attacks, where attackers inject fake data to manipulate aggregated results. In this work, we make the first attempt to analyze vulnerabilities in several representative LDP trajectory protocols. We propose \textsc{TraP}, a heuristic algorithm for data \underline{P}oisoning attacks using a prefix-suffix method to optimize fake \underline{Tra}jectory selection, significantly reducing computational complexity. Our experimental results demonstrate that our attack can substantially increase target pattern occurrences in the perturbed trajectory dataset with few fake users. This study underscores the urgent need for robust defenses and better protocol designs to safeguard LDP trajectory data against malicious manipulation.

Title: Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction

Authors: Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07485
Pdf URL: https://arxiv.org/pdf/2503.07485
Copy Paste: [[2503.07485]] Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction(https://arxiv.org/abs/2503.07485)
Keywords: extraction
Abstract: Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at this https URL

Title: LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

Authors: Bangyan Li, Wenxuan Huang, Yunhang Shen, Yeqiang Wang, Shaohui Lin, Jingzhong Lin, Ling You, Yinqi Zhang, Ke Li, Xing Sun, Yuling Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07487
Pdf URL: https://arxiv.org/pdf/2503.07487
Copy Paste: [[2503.07487]] LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?(https://arxiv.org/abs/2503.07487)
Keywords: robust, large language model
Abstract: Recently, multimodal large models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, MLLMs usually perform poorly in zero-shot medical disease recognition, as they do not fully exploit the captured features and available medical knowledge. To address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities, which effectively utilizes image and text representations and facilitates robust cross-modal alignment. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. DKAM improves category-level alignment, allowing for accurate disease recognition. Extensive experiments on multiple benchmarks demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition and exhibits the state-of-the-art performance compared to the well-established and highly-optimized CLIP-based approaches.

Title: V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

Authors: Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07493
Pdf URL: https://arxiv.org/pdf/2503.07493
Copy Paste: [[2503.07493]] V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation(https://arxiv.org/abs/2503.07493)
Keywords: transformer, large language model
Abstract: We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. this https URL

Title: Trustworthy Machine Learning via Memorization and the Granular Long-Tail: A Survey on Interactions, Tradeoffs, and Beyond

Authors: Qiongxiu Li, Xiaoyu Luo, Yiyi Chen, Johannes Bjerva
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07501
Pdf URL: https://arxiv.org/pdf/2503.07501
Copy Paste: [[2503.07501]] Trustworthy Machine Learning via Memorization and the Granular Long-Tail: A Survey on Interactions, Tradeoffs, and Beyond(https://arxiv.org/abs/2503.07501)
Keywords: privacy, robust, fair
Abstract: The role of memorization in machine learning (ML) has garnered significant attention, particularly as modern models are empirically observed to memorize fragments of training data. Previous theoretical analyses, such as Feldman's seminal work, attribute memorization to the prevalence of long-tail distributions in training data, proving it unavoidable for samples that lie in the tail of the distribution. However, the intersection of memorization and trustworthy ML research reveals critical gaps. While prior research in memorization in trustworthy ML has solely focused on class imbalance, recent work starts to differentiate class-level rarity from atypical samples, which are valid and rare intra-class instances. However, a critical research gap remains: current frameworks conflate atypical samples with noisy and erroneous data, neglecting their divergent impacts on fairness, robustness, and privacy. In this work, we conduct a thorough survey of existing research and their findings on trustworthy ML and the role of memorization. More and beyond, we identify and highlight uncharted gaps and propose new revenues in this research direction. Since existing theoretical and empirical analyses lack the nuances to disentangle memorization's duality as both a necessity and a liability, we formalize three-level long-tail granularity - class imbalance, atypicality, and noise - to reveal how current frameworks misapply these levels, perpetuating flawed solutions. By systematizing this granularity, we draw a roadmap for future research. Trustworthy ML must reconcile the nuanced trade-offs between memorizing atypicality for fairness assurance and suppressing noise for robustness and privacy guarantee. Redefining memorization via this granularity reshapes the theoretical foundation for trustworthy ML, and further affords an empirical prerequisite for models that align performance with societal trust.

Title: Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts

Authors: Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07503
Pdf URL: https://arxiv.org/pdf/2503.07503
Copy Paste: [[2503.07503]] Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts(https://arxiv.org/abs/2503.07503)
Keywords: large language model, segmentation
Abstract: Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.

Title: From Centralized to Decentralized Federated Learning: Theoretical Insights, Privacy Preservation, and Robustness Challenges

Authors: Qiongxiu Li, Wenrui Yu, Yufei Xia, Jun Pang
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2503.07505
Pdf URL: https://arxiv.org/pdf/2503.07505
Copy Paste: [[2503.07505]] From Centralized to Decentralized Federated Learning: Theoretical Insights, Privacy Preservation, and Robustness Challenges(https://arxiv.org/abs/2503.07505)
Keywords: privacy, attack, robust, federate
Abstract: Federated Learning (FL) enables collaborative learning without directly sharing individual's raw data. FL can be implemented in either a centralized (server-based) or decentralized (peer-to-peer) manner. In this survey, we present a novel perspective: the fundamental difference between centralized FL (CFL) and decentralized FL (DFL) is not merely the network topology, but the underlying training protocol: separate aggregation vs. joint optimization. We argue that this distinction in protocol leads to significant differences in model utility, privacy preservation, and robustness to attacks. We systematically review and categorize existing works in both CFL and DFL according to the type of protocol they employ. This taxonomy provides deeper insights into prior research and clarifies how various approaches relate or differ. Through our analysis, we identify key gaps in the literature. In particular, we observe a surprising lack of exploration of DFL approaches based on distributed optimization methods, despite their potential advantages. We highlight this under-explored direction and call for more research on leveraging distributed optimization for federated learning. Overall, this work offers a comprehensive overview from centralized to decentralized FL, sheds new light on the core distinctions between approaches, and outlines open challenges and future directions for the field.

Title: ADROIT: A Self-Supervised Framework for Learning Robust Representations for Active Learning

Authors: Soumya Banerjee, Vinay Kumar Verma
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07506
Pdf URL: https://arxiv.org/pdf/2503.07506
Copy Paste: [[2503.07506]] ADROIT: A Self-Supervised Framework for Learning Robust Representations for Active Learning(https://arxiv.org/abs/2503.07506)
Keywords: robust
Abstract: Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.

Title: PE3R: Perception-Efficient 3D Reconstruction

Authors: Jie Hu, Shizun Wang, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07507
Pdf URL: https://arxiv.org/pdf/2503.07507
Copy Paste: [[2503.07507]] PE3R: Perception-Efficient 3D Reconstruction(https://arxiv.org/abs/2503.07507)
Keywords: robust, segmentation
Abstract: Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: this https URL.

Title: Language Models Fail to Introspect About Their Knowledge of Language

Authors: Siyuan Song, Jennifer Hu, Kyle Mahowald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07513
Pdf URL: https://arxiv.org/pdf/2503.07513
Copy Paste: [[2503.07513]] Language Models Fail to Introspect About Their Knowledge of Language(https://arxiv.org/abs/2503.07513)
Keywords: large language model
Abstract: There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Title: FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection

Authors: Takeru Inoue, Ryusuke Miyamoto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07517
Pdf URL: https://arxiv.org/pdf/2503.07517
Copy Paste: [[2503.07517]] FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection(https://arxiv.org/abs/2503.07517)
Keywords: transformer
Abstract: Instance shadow detection is the task of detecting pairs of shadows and objects, where existing methods first detect shadows and objects independently, then associate them. This paper introduces FastInstShadow, a method that enhances detection accuracy through a query-based architecture featuring an association transformer decoder with two dual-path transformer decoders to assess relationships between shadows and objects during detection. Experimental results using the SOBA dataset showed that the proposed method outperforms all existing methods across all criteria. This method makes real-time processing feasible for moderate-resolution images with better accuracy than SSISv2, the most accurate existing method. Our code is available at this https URL.

Title: TokenButler: Token Importance is Predictable

Authors: Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07518
Pdf URL: https://arxiv.org/pdf/2503.07518
Copy Paste: [[2503.07518]] TokenButler: Token Importance is Predictable(https://arxiv.org/abs/2503.07518)
Keywords: large language model
Abstract: Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: this https URL

Title: XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Authors: Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07539
Pdf URL: https://arxiv.org/pdf/2503.07539
Copy Paste: [[2503.07539]] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following(https://arxiv.org/abs/2503.07539)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.

Title: KSOD: Knowledge Supplement for LLMs On Demand

Authors: Haoran Li, Junfeng Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07550
Pdf URL: https://arxiv.org/pdf/2503.07550
Copy Paste: [[2503.07550]] KSOD: Knowledge Supplement for LLMs On Demand(https://arxiv.org/abs/2503.07550)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet still produce errors in domain-specific tasks. To further improve their performance, we propose KSOD (Knowledge Supplement for LLMs On Demand), a novel framework that empowers LLMs to improve their capabilities with knowledge-based supervised fine-tuning (SFT). KSOD analyzes the causes of errors from the perspective of knowledge deficiency by identifying potential missing knowledge in LLM that may lead to the errors. Subsequently, KSOD tunes a knowledge module on knowledge dataset and verifies whether the LLM lacks the identified knowledge based on it. If the knowledge is verified, KSOD supplements the LLM with the identified knowledge using the knowledge module. Tuning LLMs on specific knowledge instead of specific task decouples task and knowledge and our experiments on two domain-specific benchmarks and four general benchmarks empirically demonstrate that KSOD enhances the performance of LLMs on tasks requiring the supplemented knowledge while preserving their performance on other tasks. Our findings shed light on the potential of improving the capabilities of LLMs with knowledge-based SFT.

Title: Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency

Authors: Duy Phuong Nguyen, J. Pablo Munoz, Tanya Roosta, Ali Jannesari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07552
Pdf URL: https://arxiv.org/pdf/2503.07552
Copy Paste: [[2503.07552]] Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency(https://arxiv.org/abs/2503.07552)
Keywords: privacy, federate
Abstract: Federated Learning (FL) enables collaborative learning across distributed clients while preserving data privacy. However, FL faces significant challenges when dealing with heterogeneous data distributions, which can lead to suboptimal global models that fail to generalize across diverse clients. In this work, we propose a novel framework designed to tackle these challenges by introducing a dual-adapter approach. The method utilizes a larger local adapter for client-specific personalization and a smaller global adapter to facilitate efficient knowledge sharing across clients. Additionally, we incorporate a pruning mechanism to reduce communication overhead by selectively removing less impactful parameters from the local adapter. Through extensive experiments on a range of vision and language tasks, our method demonstrates superior performance compared to existing approaches. It achieves higher test accuracy, lower performance variance among clients, and improved worst-case performance, all while significantly reducing communication and computation costs. Overall, the proposed method addresses the critical trade-off between model personalization and generalization, offering a scalable solution for real-world FL applications.

Title: Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

Authors: Thibaut Loiseau, Guillaume Bourmaud, Vincent Lepetit
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07561
Pdf URL: https://arxiv.org/pdf/2503.07561
Copy Paste: [[2503.07561]] Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression(https://arxiv.org/abs/2503.07561)
Keywords: segmentation
Abstract: Pre-training techniques have greatly advanced computer vision, with CroCo's cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, this method requires substantial overlap between training pairs, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that reformulates cross-view learning as a co-visibility segmentation task. Our method predicts whether each pixel in one image is co-visible in the second image, occluded, or outside the field of view (FOV), enabling the use of image pairs with any degree of overlap and providing interpretable predictions. To support this, we present Cub3, a large-scale dataset with 2.5 million image pairs and dense co-visibility annotations derived from the nuScenes dataset. This dataset includes diverse scenarios with varying degrees of overlap. The experiments show that Alligat0R significantly outperforms CroCo in relative pose regression, especially in scenarios with limited overlap. Alligat0R and Cub3 will be made publicly available.

Title: Inductive Moment Matching

Authors: Linqi Zhou, Stefano Ermon, Jiaming Song
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.07565
Pdf URL: https://arxiv.org/pdf/2503.07565
Copy Paste: [[2503.07565]] Inductive Moment Matching(https://arxiv.org/abs/2503.07565)
Keywords: diffusion, generative
Abstract: Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Inductive Moment Matching (IMM), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, IMM does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, IMM guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. IMM surpasses diffusion models on ImageNet-256x256 with 1.99 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 1.98 on CIFAR-10 for a model trained from scratch.

Title: Runtime Detection of Adversarial Attacks in AI Accelerators Using Performance Counters

Authors: Habibur Rahaman, Atri Chatterjee, Swarup Bhunia
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07568
Pdf URL: https://arxiv.org/pdf/2503.07568
Copy Paste: [[2503.07568]] Runtime Detection of Adversarial Attacks in AI Accelerators Using Performance Counters(https://arxiv.org/abs/2503.07568)
Keywords: secure, security, protect, attack
Abstract: Rapid adoption of AI technologies raises several major security concerns, including the risks of adversarial perturbations, which threaten the confidentiality and integrity of AI applications. Protecting AI hardware from misuse and diverse security threats is a challenging task. To address this challenge, we propose SAMURAI, a novel framework for safeguarding against malicious usage of AI hardware and its resilience to attacks. SAMURAI introduces an AI Performance Counter (APC) for tracking dynamic behavior of an AI model coupled with an on-chip Machine Learning (ML) analysis engine, known as TANTO (Trained Anomaly Inspection Through Trace Observation). APC records the runtime profile of the low-level hardware events of different AI operations. Subsequently, the summary information recorded by the APC is processed by TANTO to efficiently identify potential security breaches and ensure secure, responsible use of AI. SAMURAI enables real-time detection of security threats and misuse without relying on traditional software-based solutions that require model integration. Experimental results demonstrate that SAMURAI achieves up to 97% accuracy in detecting adversarial attacks with moderate overhead on various AI models, significantly outperforming conventional software-based approaches. It enhances security and regulatory compliance, providing a comprehensive solution for safeguarding AI against emergent threats.

Title: Split-n-Chain: Privacy-Preserving Multi-Node Split Learning with Blockchain-Based Auditability

Authors: Mukesh Sahani, Binanda Sengupta
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07570
Pdf URL: https://arxiv.org/pdf/2503.07570
Copy Paste: [[2503.07570]] Split-n-Chain: Privacy-Preserving Multi-Node Split Learning with Blockchain-Based Auditability(https://arxiv.org/abs/2503.07570)
Keywords: security, privacy, federate
Abstract: Deep learning, when integrated with a large amount of training data, has the potential to outperform machine learning in terms of high accuracy. Recently, privacy-preserving deep learning has drawn significant attention of the research community. Different privacy notions in deep learning include privacy of data provided by data-owners and privacy of parameters and/or hyperparameters of the underlying neural network. Federated learning is a popular privacy-preserving execution environment where data-owners participate in learning the parameters collectively without leaking their respective data to other participants. However, federated learning suffers from certain security/privacy issues. In this paper, we propose Split-n-Chain, a variant of split learning where the layers of the network are split among several distributed nodes. Split-n-Chain achieves several privacy properties: data-owners need not share their training data with other nodes, and no nodes have access to the parameters and hyperparameters of the neural network (except that of the respective layers they hold). Moreover, Split-n-Chain uses blockchain to audit the computation done by different nodes. Our experimental results show that: Split-n-Chain is efficient, in terms of time required to execute different phases, and the training loss trend is similar to that for the same neural network when implemented in a monolithic fashion.

Title: Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

Authors: Tianyu Chen, Yasi Zhang, Zhendong Wang, Ying Nian Wu, Oscar Leong, Mingyuan Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07578
Pdf URL: https://arxiv.org/pdf/2503.07578
Copy Paste: [[2503.07578]] Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation(https://arxiv.org/abs/2503.07578)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.

Title: Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

Authors: Dunant Cusipuma, David Ortega, Victor Flores-Benites, Arturo Deza
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07587
Pdf URL: https://arxiv.org/pdf/2503.07587
Copy Paste: [[2503.07587]] Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru(https://arxiv.org/abs/2503.07587)
Keywords: robust, segmentation
Abstract: As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

Title: Detection Avoidance Techniques for Large Language Models

Authors: Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07595
Pdf URL: https://arxiv.org/pdf/2503.07595
Copy Paste: [[2503.07595]] Detection Avoidance Techniques for Large Language Models(https://arxiv.org/abs/2503.07595)
Keywords: generative, large language model
Abstract: The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

Title: HumanMM: Global Human Motion Recovery from Multi-shot Videos

Authors: Yuhong Zhang, Guanlin Wu, Ling-Hao Chen, Zhuokai Zhao, Jing Lin, Xiaoke Jiang, Jiamin Wu, Zhuoheng Li, Hao Frank Yang, Haoqian Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07597
Pdf URL: https://arxiv.org/pdf/2503.07597
Copy Paste: [[2503.07597]] HumanMM: Global Human Motion Recovery from Multi-shot Videos(https://arxiv.org/abs/2503.07597)
Keywords: robust
Abstract: In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.

Title: VACE: All-in-One Video Creation and Editing

Authors: Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07598
Pdf URL: https://arxiv.org/pdf/2503.07598
Copy Paste: [[2503.07598]] VACE: All-in-One Video Creation and Editing(https://arxiv.org/abs/2503.07598)
Keywords: diffusion, transformer
Abstract: Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: this https URL.

Title: Balanced Image Stylization with Style Matching Score

Authors: Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07601
Pdf URL: https://arxiv.org/pdf/2503.07601
Copy Paste: [[2503.07601]] Balanced Image Stylization with Style Matching Score(https://arxiv.org/abs/2503.07601)
Keywords: diffusion
Abstract: We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.

Title: Implicit Reasoning in Transformers is Reasoning through Shortcuts

Authors: Tianhe Lin, Jian Xie, Siyu Yuan, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07604
Pdf URL: https://arxiv.org/pdf/2503.07604
Copy Paste: [[2503.07604]] Implicit Reasoning in Transformers is Reasoning through Shortcuts(https://arxiv.org/abs/2503.07604)
Keywords: transformer, large language model
Abstract: Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.

Title: SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models

Authors: Xun Liang, Hanyu Wang, Huayi Lai, Simin Niu, Shichao Song, Jiawei Yang, Jihao Zhao, Feiyu Xiong, Bo Tang, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07605
Pdf URL: https://arxiv.org/pdf/2503.07605
Copy Paste: [[2503.07605]] SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models(https://arxiv.org/abs/2503.07605)
Keywords: large language model
Abstract: Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.

Title: VoD: Learning Volume of Differences for Video-Based Deepfake Detection

Authors: Ying Xu, Marius Pedersen, Kiran Raja
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07607
Pdf URL: https://arxiv.org/pdf/2503.07607
Copy Paste: [[2503.07607]] VoD: Learning Volume of Differences for Video-Based Deepfake Detection(https://arxiv.org/abs/2503.07607)
Keywords: generative
Abstract: The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at this https URL.