2025-03-25

Title: How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Authors: Antonio-Gabriel Chacón Menke (Shibaura Institute of Technology, Kempten University of Applied Sciences), Phan Xuan Tan (Shibaura Institute of Technology)
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.17365
Pdf URL: https://arxiv.org/pdf/2503.17365
Copy Paste: [[2503.17365]] How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers(https://arxiv.org/abs/2503.17365)
Keywords: large language model
Abstract: Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1, Gemma-2, Llama 3.1, and Qwen2.5. Using HarmBench, we demonstrate that while all models showed capacity for harm reduction through self-critique, effectiveness varied significantly, with DeepSeek-R1's explicit reasoning process yielding superior results. These findings suggest that CAI-inspired prompting strategies can enhance safety in resource-constrained models, though success depends on the model's capacity for harm detection.

Title: State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling

Authors: Andrew Kiruluta, Andreas Lemos
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17382
Pdf URL: https://arxiv.org/pdf/2503.17382
Copy Paste: [[2503.17382]] State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling(https://arxiv.org/abs/2503.17382)
Keywords: diffusion, transformer, generative
Abstract: In recent years, diffusion based methods have emerged as a powerful paradigm for generative modeling. Although discrete diffusion for natural language processing has been explored to a lesser extent, it shows promise for tasks requiring iterative denoising of token based data. In standard approaches to text generation, transformers dominate, but their reliance on self attention often incurs high computational costs. This paper introduces a fully diffusion driven discrete text generation model built without any transformer or large convolution modules. Instead, the model integrates structured state space dynamics in the time domain with a novel Complex Fourier Multi Layer Perceptron module that operates in the frequency domain. The forward noising process randomly samples the vocabulary to replace tokens with a controlled probability, while the learned reverse model systematically reverts corrupted sequences toward their original states. By composing local state space updates with global Fourier based mixing, the approach effectively captures both short and long range dependencies.

Title: ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models

Authors: Azim Akhtarshenas, Afshin Dini, Navid Ayoobi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17403
Pdf URL: https://arxiv.org/pdf/2503.17403
Copy Paste: [[2503.17403]] ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models(https://arxiv.org/abs/2503.17403)
Keywords: privacy, transformer, generative, large language model
Abstract: Large Language Models (LLMs) have revo lutionized natural language processing Natural Language Processing (NLP), with Chat Generative Pre-trained Transformer (ChatGPT) standing out as a notable exampledue to its advanced capabilities and widespread applications. This survey provides a comprehensive analysis of ChatGPT, exploring its architecture, training processes, and functionalities. We examine its integration into various domains across industries such as customer service, education, healthcare, and entertainment. A comparative analysis with other LLMs highlights ChatGPT's unique features and performance metrics. Regarding benchmarks, the paper examines ChatGPT's comparative performance against other LLMs and discusses potential risks such as misinformation, bias, and data privacy concerns. Additionally, we offer a number of figures and tables that outline the backdrop of the discussion, the main ideas of the article, the numerous LLM models, a thorough list of datasets used for pre-training, fine-tuning, and evaluation, as well as particular LLM applications with pertinent references. Finally, we identify future research directions and technological advancements, underscoring the evolving landscape of LLMs and their profound impact on artificial intelligence Artificial Intelligence (AI) and society.

Title: IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes

Authors: Haochen Zhang, Nader Zantout, Pujith Kachana, Ji Zhang, Wenshan Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.17406
Pdf URL: https://arxiv.org/pdf/2503.17406
Copy Paste: [[2503.17406]] IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes(https://arxiv.org/abs/2503.17406)
Keywords: robust, large language model
Abstract: With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at this https URL.

Title: A Comprehensive Survey on Long Context Language Modeling

Authors: Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17407
Pdf URL: https://arxiv.org/pdf/2503.17407
Copy Paste: [[2503.17407]] A Comprehensive Survey on Long Context Language Modeling(https://arxiv.org/abs/2503.17407)
Keywords: interpretability, large language model
Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{this https URL}{\color[RGB]{175,36,67}{LCLM-Horizon}}.

Title: Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

Authors: Yicheng Duan, Xi Huang, Duo Chen
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.17415
Pdf URL: https://arxiv.org/pdf/2503.17415
Copy Paste: [[2503.17415]] Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)(https://arxiv.org/abs/2503.17415)
Keywords: robust
Abstract: The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.

Title: Generative Modeling of Class Probability for Multi-Modal Representation Learning

Authors: Jungkyoo Shin, Bumsoo Kim, Eunwoo Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17417
Pdf URL: https://arxiv.org/pdf/2503.17417
Copy Paste: [[2503.17417]] Generative Modeling of Class Probability for Multi-Modal Representation Learning(https://arxiv.org/abs/2503.17417)
Keywords: generative
Abstract: Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more effective alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.

Title: V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

Authors: Javier J. Poveda Rodrigo, Mohamed Amine Ahmdi, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2503.17422
Pdf URL: https://arxiv.org/pdf/2503.17422
Copy Paste: [[2503.17422]] V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms(https://arxiv.org/abs/2503.17422)
Keywords: large language model
Abstract: The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

Title: Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP

Authors: Veysel Kocaman, Yigit Gul, M. Aytug Kaya, Hasham Ul Haq, Mehmet Butgul, Cabir Celik, David Talby
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17425
Pdf URL: https://arxiv.org/pdf/2503.17425
Copy Paste: [[2503.17425]] Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP(https://arxiv.org/abs/2503.17425)
Keywords: extraction, transformer
Abstract: Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.

Title: Enhanced Smart Contract Reputability Analysis using Multimodal Data Fusion on Ethereum

Authors: Cyrus Malik, Josef Bajada, Joshua Ellul
Subjects: cs.LG, cs.AI, cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2503.17426
Pdf URL: https://arxiv.org/pdf/2503.17426
Copy Paste: [[2503.17426]] Enhanced Smart Contract Reputability Analysis using Multimodal Data Fusion on Ethereum(https://arxiv.org/abs/2503.17426)
Keywords: security, robust
Abstract: The evaluation of smart contract reputability is essential to foster trust in decentralized ecosystems. However, existing methods that rely solely on static code analysis or transactional data, offer limited insight into evolving trustworthiness. We propose a multimodal data fusion framework that integrates static code features with transactional data to enhance reputability prediction. Our framework initially focuses on static code analysis, utilizing GAN-augmented opcode embeddings to address class imbalance, achieving 97.67% accuracy and a recall of 0.942 in detecting illicit contracts, surpassing traditional oversampling methods. This forms the crux of a reputability-centric fusion strategy, where combining static and transactional data improves recall by 7.25% over single-source models, demonstrating robust performance across validation sets. By providing a holistic view of smart contract behaviour, our approach enhances the model's ability to assess reputability, identify fraudulent activities, and predict anomalous patterns. These capabilities contribute to more accurate reputability assessments, proactive risk mitigation, and enhanced blockchain security.

Title: On-Device Federated Continual Learning on RISC-V-based Ultra-Low-Power SoC for Intelligent Nano-Drone Swarms

Authors: Lars Kröger, Cristian Cioflan, Victor Kartsch, Luca Benini
Subjects: cs.LG, cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2503.17436
Pdf URL: https://arxiv.org/pdf/2503.17436
Copy Paste: [[2503.17436]] On-Device Federated Continual Learning on RISC-V-based Ultra-Low-Power SoC for Intelligent Nano-Drone Swarms(https://arxiv.org/abs/2503.17436)
Keywords: privacy, federate
Abstract: RISC-V-based architectures are paving the way for efficient On-Device Learning (ODL) in smart edge devices. When applied across multiple nodes, ODL enables the creation of intelligent sensor networks that preserve data privacy. However, developing ODL-capable, battery-operated embedded platforms presents significant challenges due to constrained computational resources and limited device lifetime, besides intrinsic learning issues such as catastrophic forgetting. We face these challenges by proposing a regularization-based On-Device Federated Continual Learning algorithm tailored for multiple nano-drones performing face recognition tasks. We demonstrate our approach on a RISC-V-based 10-core ultra-low-power SoC, optimizing the ODL computational requirements. We improve the classification accuracy by 24% over naive fine-tuning, requiring 178 ms per local epoch and 10.5 s per global epoch, demonstrating the effectiveness of the architecture for this task.

Title: LEMMA: Learning from Errors for MatheMatical Advancement in LLMs

Authors: Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17439
Pdf URL: https://arxiv.org/pdf/2503.17439
Copy Paste: [[2503.17439]] LEMMA: Learning from Errors for MatheMatical Advancement in LLMs(https://arxiv.org/abs/2503.17439)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.

Title: CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series

Authors: Gideon Stein, Maha Shadaydeh, Jan Blunk, Niklas Penzel, Joachim Denzler
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17452
Pdf URL: https://arxiv.org/pdf/2503.17452
Copy Paste: [[2503.17452]] CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series(https://arxiv.org/abs/2503.17452)
Keywords: robust
Abstract: Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation of these methods is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, making it hard to decide on a proper causal discovery strategy. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time-series data to date. CausalRivers features an extensive dataset on river discharge that covers the eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide additional data from a flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of CausalRivers, we evaluate several causal discovery approaches through a set of experiments to identify areas for improvement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time-series forecasting and anomaly detection. Based on this, we hope to push benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.

Title: Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

Authors: Ran Liu, Fengyu Zhang, Cong Yu, Longjiang Yang, Zhuofan Wen, Siyuan Zhang, Hailiang Yao, Shun Chen, Zheng Lian, Bin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17453
Pdf URL: https://arxiv.org/pdf/2503.17453
Copy Paste: [[2503.17453]] Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition(https://arxiv.org/abs/2503.17453)
Keywords: extraction, transformer
Abstract: This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) this http URL emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior this http URL code are avalible on this https URL

Title: Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis

Authors: Ali Beikmohammadi, Sarit Khirirat, Peter Richtárik, Sindri Magnússon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17454
Pdf URL: https://arxiv.org/pdf/2503.17454
Copy Paste: [[2503.17454]] Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis(https://arxiv.org/abs/2503.17454)
Keywords: privacy, federate
Abstract: Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents. However, many existing FedRL algorithms assume that all agents operate in identical environments, which is often unrealistic. In real-world applications -- such as multi-robot teams, crowdsourced systems, and large-scale sensor networks -- each agent may experience slightly different transition dynamics, leading to inherent model mismatches. In this paper, we first establish linear convergence guarantees for single-agent temporal difference learning (TD(0)) in policy evaluation and demonstrate that under a perturbed environment, the agent suffers a systematic bias that prevents accurate estimation of the true value function. This result holds under both i.i.d. and Markovian sampling regimes. We then extend our analysis to the federated TD(0) (FedTD(0)) setting, where multiple agents -- each interacting with its own perturbed environment -- periodically share value estimates to collaboratively approximate the true value function of a common underlying model. Our theoretical results indicate the impact of model mismatch, network connectivity, and mixing behavior on the convergence of FedTD(0). Empirical experiments corroborate our theoretical gains, highlighting that even moderate levels of information sharing can significantly mitigate environment-specific errors.

Title: Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer

Authors: Soumen Kumar Mondal, Sayambhu Sen, Abhishek Singhania, Preethi Jyothi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17456
Pdf URL: https://arxiv.org/pdf/2503.17456
Copy Paste: [[2503.17456]] Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer(https://arxiv.org/abs/2503.17456)
Keywords: robust, large language model
Abstract: Multilingual large language models (LLMs) aim towards robust natural language understanding across diverse languages, yet their performance significantly degrades on low-resource languages. This work explores whether existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of lowresource languages. We conduct detailed experiments covering existing language-specific neuron identification techniques (such as Language Activation Probability Entropy and activation probability-based thresholding) and neuron-specific LoRA fine-tuning with models like Llama 3.1 and Mistral Nemo. We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks (XNLI, XQuAD) in lowresource languages. This study highlights the challenges in achieving cross-lingual generalization and provides critical insights for multilingual LLMs.

Title: Bayesian generative models can flag performance loss, bias, and out-of-distribution image content

Authors: Miguel López-Pérez, Marco Miani, Valery Naranjo, Søren Hauberg, Aasa Feragen
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17477
Pdf URL: https://arxiv.org/pdf/2503.17477
Copy Paste: [[2503.17477]] Bayesian generative models can flag performance loss, bias, and out-of-distribution image content(https://arxiv.org/abs/2503.17477)
Keywords: extraction, generative
Abstract: Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score -- unlike the VAE's encoder variances -- correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.

Title: What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Authors: Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan
Subjects: cs.LG, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.17482
Pdf URL: https://arxiv.org/pdf/2503.17482
Copy Paste: [[2503.17482]] What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models(https://arxiv.org/abs/2503.17482)
Keywords: generative, large language model
Abstract: How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.

Title: SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

Authors: Lama Ayash, Hassan Alhuzali, Ashwag Alasmari, Sultan Aloufi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17485
Pdf URL: https://arxiv.org/pdf/2503.17485
Copy Paste: [[2503.17485]] SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia(https://arxiv.org/abs/2503.17485)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing; however, they often struggle to accurately capture and reflect cultural nuances. This research addresses this challenge by focusing on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of LLMs within the distinct geographical and cultural contexts of Saudi Arabia. SaudiCulture is a comprehensive dataset of questions covering five major geographical regions, such as West, East, South, North, and Center, along with general questions applicable across all regions. The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts. To ensure a rigorous evaluation, SaudiCulture includes questions of varying complexity, such as open-ended, single-choice, and multiple-choice formats, with some requiring multiple correct answers. Additionally, the dataset distinguishes between common cultural knowledge and specialized regional aspects. We conduct extensive evaluations on five LLMs, such as GPT-4, Llama 3.3, FANAR, Jais, and AceGPT, analyzing their performance across different question types and cultural contexts. Our findings reveal that all models experience significant performance declines when faced with highly specialized or region-specific questions, particularly those requiring multiple correct responses. Additionally, certain cultural categories are more easily identifiable than others, further highlighting inconsistencies in LLMs cultural understanding. These results emphasize the importance of incorporating region-specific knowledge into LLMs training to enhance their cultural competence.

Title: ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing

Authors: Tianwen Zhou, Jing Wang, Songtao Wu, Kuanhong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17488
Pdf URL: https://arxiv.org/pdf/2503.17488
Copy Paste: [[2503.17488]] ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing(https://arxiv.org/abs/2503.17488)
Keywords: diffusion
Abstract: Recent approaches using large-scale pretrained diffusion models for image dehazing improve perceptual quality but often suffer from hallucination issues, producing unfaithful dehazed image to the original one. To mitigate this, we propose ProDehaze, a framework that employs internal image priors to direct external priors encoded in pretrained models. We introduce two types of \textit{selective} internal priors that prompt the model to concentrate on critical image areas: a Structure-Prompted Restorer in the latent space that emphasizes structure-rich regions, and a Haze-Aware Self-Correcting Refiner in the decoding process to align distributions between clearer input regions and the output. Extensive experiments on real-world datasets demonstrate that ProDehaze achieves high-fidelity results in image dehazing, particularly in reducing color shifts. Our code is at this https URL.

Title: Judge Anything: MLLM as a Judge Across Any Modality

Authors: Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.17489
Pdf URL: https://arxiv.org/pdf/2503.17489
Copy Paste: [[2503.17489]] Judge Anything: MLLM as a Judge Across Any Modality(https://arxiv.org/abs/2503.17489)
Keywords: fair, generative
Abstract: Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: this https URL.

Title: Meme Similarity and Emotion Detection using Multimodal Analysis

Authors: Aidos Konyspay, Pakizar Shamoi, Malika Ziyada, Zhusup Smambayev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17493
Pdf URL: https://arxiv.org/pdf/2503.17493
Copy Paste: [[2503.17493]] Meme Similarity and Emotion Detection using Multimodal Analysis(https://arxiv.org/abs/2503.17493)
Keywords: robust
Abstract: Internet memes are a central element of online culture, blending images and text. While substantial research has focused on either the visual or textual components of memes, little attention has been given to their interplay. This gap raises a key question: What methodology can effectively compare memes and the emotions they elicit? Our study employs a multimodal methodological approach, analyzing both the visual and textual elements of memes. Specifically, we perform a multimodal CLIP (Contrastive Language-Image Pre-training) model for grouping similar memes based on text and visual content embeddings, enabling robust similarity assessments across modalities. Using the Reddit Meme Dataset and Memotion Dataset, we extract low-level visual features and high-level semantic features to identify similar meme pairs. To validate these automated similarity assessments, we conducted a user study with 50 participants, asking them to provide yes/no responses regarding meme similarity and their emotional reactions. The comparison of experimental results with human judgments showed a 67.23\% agreement, suggesting that the computational approach aligns well with human perception. Additionally, we implemented a text-based classifier using the DistilBERT model to categorize memes into one of six basic emotions. The results indicate that anger and joy are the dominant emotions in memes, with motivational memes eliciting stronger emotional responses. This research contributes to the study of multimodal memes, enhancing both language-based and visual approaches to analyzing and improving online visual communication and user experiences. Furthermore, it provides insights for better content moderation strategies in online platforms.

Title: Efficient Knowledge Distillation via Curriculum Extraction

Authors: Shivam Gupta, Sushrut Karmalkar
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17494
Pdf URL: https://arxiv.org/pdf/2503.17494
Copy Paste: [[2503.17494]] Efficient Knowledge Distillation via Curriculum Extraction(https://arxiv.org/abs/2503.17494)
Keywords: extraction, transformer
Abstract: Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citep{Hinton2015DistillingTK}. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~\citep{panigrahi2024progressive} has shown that using intermediate checkpoints from the teacher's training process as an implicit ``curriculum'' for progressive distillation can significantly speed up training. However, such schemes require storing these checkpoints, and often require careful selection of the intermediate checkpoints to train on, which can be impractical for large-scale training. In this paper, we show that a curriculum can be \emph{extracted} from just the fully trained teacher network, and that this extracted curriculum can give similar efficiency benefits to those of progressive distillation. Our extraction scheme is natural; we use a random projection of the hidden representations of the teacher network to progressively train the student network, before training using the output of the full network. We show that our scheme significantly outperforms one-shot distillation and achieves a performance similar to that of progressive distillation for learning sparse parities with two-layer networks, and provide theoretical guarantees for this setting. Additionally, we show that our method outperforms one-shot distillation even when using transformer-based architectures, both for sparse-parity learning, and language modeling tasks.

Title: Variance Control via Weight Rescaling in LLM Pre-training

Authors: Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17500
Pdf URL: https://arxiv.org/pdf/2503.17500
Copy Paste: [[2503.17500]] Variance Control via Weight Rescaling in LLM Pre-training(https://arxiv.org/abs/2503.17500)
Keywords: large language model
Abstract: The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: this https URL.

Title: Towards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields

Authors: Anran Xu, Lindsey J. Heagy
Subjects: cs.LG, physics.geo-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17503
Pdf URL: https://arxiv.org/pdf/2503.17503
Copy Paste: [[2503.17503]] Towards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields(https://arxiv.org/abs/2503.17503)
Keywords: diffusion, generative
Abstract: In this work, we employ neural fields, which use neural networks to map a coordinate to the corresponding physical property value at that coordinate, in a test-time learning manner. For a test-time learning method, the weights are learned during the inversion, as compared to traditional approaches which require a network to be trained using a training data set. Results for synthetic examples in seismic tomography and direct current resistivity inversions are shown first. We then perform a singular value decomposition analysis on the Jacobian of the weights of the neural network (SVD analysis) for both cases to explore the effects of neural networks on the recovered model. The results show that the test-time learning approach can eliminate unwanted artifacts in the recovered subsurface physical property model caused by the sensitivity of the survey and physics. Therefore, NFs-Inv improves the inversion results compared to the conventional inversion in some cases such as the recovery of the dip angle or the prediction of the boundaries of the main target. In the SVD analysis, we observe similar patterns in the left-singular vectors as were observed in some diffusion models, trained in a supervised manner, for generative tasks in computer vision. This observation provides evidence that there is an implicit bias, which is inherent in neural network structures, that is useful in supervised learning and test-time learning models. This implicit bias has the potential to be useful for recovering models in geophysical inversions.

Title: Improving Quantization with Post-Training Model Expansion

Authors: Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2503.17513
Pdf URL: https://arxiv.org/pdf/2503.17513
Copy Paste: [[2503.17513]] Improving Quantization with Post-Training Model Expansion(https://arxiv.org/abs/2503.17513)
Keywords: large language model
Abstract: The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.

Title: Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On

Authors: Ken Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17514
Pdf URL: https://arxiv.org/pdf/2503.17514
Copy Paste: [[2503.17514]] Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On(https://arxiv.org/abs/2503.17514)
Keywords: large language model
Abstract: An important question today is whether a given text was used to train a large language model (LLM). A \emph{completion} test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the $n$-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this $n$-gram based membership definition can be effectively gamed. We study scenarios where sequences are \emph{non-members} for a given $n$ and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of $n$ for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of $n$. Our findings highlight the inadequacy of $n$-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.

Title: Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

Authors: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17523
Pdf URL: https://arxiv.org/pdf/2503.17523
Copy Paste: [[2503.17523]] Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models(https://arxiv.org/abs/2503.17523)
Keywords: large language model
Abstract: Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent's beliefs as it receives new information. We first show that the LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available, even less so than we find is the case for humans. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model. We find that this approach not only significantly improves the LLM's performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method endows the LLM with broader Bayesian reasoning skills. More generally, our results indicate that LLMs can learn about reasoning strategies effectively and generalize those skills to new domains, which in part explains LLMs' empirical success.

Title: Should we pre-train a decoder in contrastive learning for dense prediction tasks?

Authors: Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17526
Pdf URL: https://arxiv.org/pdf/2503.17526
Copy Paste: [[2503.17526]] Should we pre-train a decoder in contrastive learning for dense prediction tasks?(https://arxiv.org/abs/2503.17526)
Keywords: segmentation
Abstract: Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.

Title: FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off

Authors: Tianyu Zhang, Fan Wan, Haoran Duan, Kevin W. Tong, Jingjing Deng, Yang Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17530
Pdf URL: https://arxiv.org/pdf/2503.17530
Copy Paste: [[2503.17530]] FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off(https://arxiv.org/abs/2503.17530)
Keywords: extraction, federate
Abstract: Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dynamic Convolution (FMDConv), which integrates input attention, temperature-degraded kernel attention, and output attention to optimize the speed-accuracy trade-off. FMDConv achieves a better balance between accuracy and efficiency by selectively enhancing feature extraction with lower complexity. Furthermore, we introduce two novel quantitative metrics, the Inverse Efficiency Score and Rate-Correct Score, to systematically evaluate this trade-off. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that FMDConv reduces the computational cost by up to 49.8\% on ResNet-18 and 42.2\% on ResNet-50 compared to prior multi-attention dynamic convolution methods while maintaining competitive accuracy. These advantages make FMDConv highly suitable for real-world, resource-constrained applications.

Title: MetaSel: A Test Selection Approach for Fine-tuned DNN Models

Authors: Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, Dayi Lin
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2503.17534
Pdf URL: https://arxiv.org/pdf/2503.17534
Copy Paste: [[2503.17534]] MetaSel: A Test Selection Approach for Fine-tuned DNN Models(https://arxiv.org/abs/2503.17534)
Keywords: robust
Abstract: Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 10 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel's practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

Title: DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis

Authors: Nusrat Munia, Abdullah-Al-Zubaer Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17536
Pdf URL: https://arxiv.org/pdf/2503.17536
Copy Paste: [[2503.17536]] DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis(https://arxiv.org/abs/2503.17536)
Keywords: diffusion, generative
Abstract: Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at this https URL

Title: Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Authors: Bhishma Dedhia, David Bourgin, Krishna Kumar Singh, Yuheng Li, Yan Kang, Zhan Xu, Niraj K. Jha, Yuchen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17539
Pdf URL: https://arxiv.org/pdf/2503.17539
Copy Paste: [[2503.17539]] Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks(https://arxiv.org/abs/2503.17539)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

Title: PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Authors: Yan Zhang, Yao Feng, Alpár Cseke, Nitin Saini, Nathan Bajandas, Nicolas Heron, Michael J. Black
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17544
Pdf URL: https://arxiv.org/pdf/2503.17544
Copy Paste: [[2503.17544]] PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning(https://arxiv.org/abs/2503.17544)
Keywords: diffusion, generative
Abstract: To build a motor system of the interactive avatar, it is essential to develop a generative motion model drives the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied, most methods do not support ``embodied intelligence'' due to their offline setting, slow speed, limited motion lengths, or unnatural movements. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances in foundation models. In the pretraining stage, the model learns motion dynamics from a large number of sub-second motion segments, providing ``motor primitives'' from which more complex motions are built. In the adaptation phase, we employ a ControlNet-like adaptor to fine-tune the motor control for semantic action generation and spatial target reaching. Experiments show that physics effects emerge from our training. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that is highly responsive and natural. Code, models, and more results are available at: this https URL

Title: Fairness-Driven LLM-based Causal Discovery with Active Learning and Dynamic Scoring

Authors: Khadija Zanna, Akane Sano
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17569
Pdf URL: https://arxiv.org/pdf/2503.17569
Copy Paste: [[2503.17569]] Fairness-Driven LLM-based Causal Discovery with Active Learning and Dynamic Scoring(https://arxiv.org/abs/2503.17569)
Keywords: fair, large language model
Abstract: Causal discovery (CD) plays a pivotal role in numerous scientific fields by clarifying the causal relationships that underlie phenomena observed in diverse disciplines. Despite significant advancements in CD algorithms that enhance bias and fairness analyses in machine learning, their application faces challenges due to the high computational demands and complexities of large-scale data. This paper introduces a framework that leverages Large Language Models (LLMs) for CD, utilizing a metadata-based approach akin to the reasoning processes of human experts. By shifting from pairwise queries to a more scalable breadth-first search (BFS) strategy, the number of required queries is reduced from quadratic to linear in terms of variable count, thereby addressing scalability concerns inherent in previous approaches. This method utilizes an Active Learning (AL) and a Dynamic Scoring Mechanism that prioritizes queries based on their potential information gain, combining mutual information, partial correlation, and LLM confidence scores to refine the causal graph more efficiently and accurately. This BFS query strategy reduces the required number of queries significantly, thereby addressing scalability concerns inherent in previous approaches. This study provides a more scalable and efficient solution for leveraging LLMs in fairness-driven CD, highlighting the effects of the different parameters on performance. We perform fairness analyses on the inferred causal graphs, identifying direct and indirect effects of sensitive attributes on outcomes. A comparison of these analyses against those from graphs produced by baseline methods highlights the importance of accurate causal graph construction in understanding bias and ensuring fairness in machine learning systems.

Title: Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting

Authors: Simona Kocour, Assia Benbihi, Aikaterini Adam, Torsten Sattler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17574
Pdf URL: https://arxiv.org/pdf/2503.17574
Copy Paste: [[2503.17574]] Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting(https://arxiv.org/abs/2503.17574)
Keywords: privacy
Abstract: Searching in and editing 3D scenes has become extremely intuitive with trainable scene representations that allow linking human concepts to elements in the scene. These operations are often evaluated on the basis of how accurately the searched element is segmented or extracted from the scene. In this paper, we address the inverse problem, that is, how much of the searched element remains in the scene after it is removed. This question is particularly important in the context of privacy-preserving mapping when a user reconstructs a 3D scene and wants to remove private elements before sharing the map. To the best of our knowledge, this is the first work to address this question. To answer this, we propose a quantitative evaluation that measures whether a removal operation leaves object residuals that can be reasoned over. The scene is not private when such residuals are present. Experiments on state-of-the-art scene representations show that the proposed metrics are meaningful and consistent with the user study that we also present. We also propose a method to refine the removal based on spatial and semantic consistency.

Title: Measuring the Robustness of Audio Deepfake Detectors

Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei
Subjects: cs.CR, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2503.17577
Pdf URL: https://arxiv.org/pdf/2503.17577
Copy Paste: [[2503.17577]] Measuring the Robustness of Audio Deepfake Detectors(https://arxiv.org/abs/2503.17577)
Keywords: robust, generative
Abstract: Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high-quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI-synthesized speech. However, real-world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self-supervised learning paradigm and large-scale pre-training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real-world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.

Title: Large Language Models Can Verbatim Reproduce Long Malicious Sequences

Authors: Sharon Lin, Krishnamurthy (Dj)Dvijotham, Jamie Hayes, Chongyang Shi, Ilia Shumailov, Shuang Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17578
Pdf URL: https://arxiv.org/pdf/2503.17578
Copy Paste: [[2503.17578]] Large Language Models Can Verbatim Reproduce Long Malicious Sequences(https://arxiv.org/abs/2503.17578)
Keywords: attack, large language model
Abstract: Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

Title: Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility

Authors: Suet-Ying Lam, Qingcheng Zeng, Jingyi Wu, Rob Voigt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17579
Pdf URL: https://arxiv.org/pdf/2503.17579
Copy Paste: [[2503.17579]] Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility(https://arxiv.org/abs/2503.17579)
Keywords: large language model
Abstract: Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size - with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior.

Title: ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently

Authors: Jaeyeon Lee, Guantong Qi, Matthew Brady Neeley, Zhandong Liu, Hyun-Hwan Jeong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17587
Pdf URL: https://arxiv.org/pdf/2503.17587
Copy Paste: [[2503.17587]] ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently(https://arxiv.org/abs/2503.17587)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI's o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: this https URL, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.

Title: LEMIX: Enabling Testing of Embedded Applications as Linux Applications

Authors: Sai Ritvik Tanksalkar, Siddharth Muralee, Srihari Danduri, Paschal Amusuo, Antonio Bianchi, James C Davis, Aravind Kumar Machiry
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2503.17588
Pdf URL: https://arxiv.org/pdf/2503.17588
Copy Paste: [[2503.17588]] LEMIX: Enabling Testing of Embedded Applications as Linux Applications(https://arxiv.org/abs/2503.17588)
Keywords: security
Abstract: Dynamic analysis, through rehosting, is an important capability for security assessment in embedded systems software. Existing rehosting techniques aim to provide high-fidelity execution by accurately emulating hardware and peripheral interactions. However, these techniques face challenges in adoption due to the increasing number of available peripherals and the complexities involved in designing emulation models for diverse hardware. Additionally, contrary to the prevailing belief that guides existing works, our analysis of reported bugs shows that high-fidelity execution is not required to expose most bugs in embedded software. Our key hypothesis is that security vulnerabilities are more likely to arise at higher abstraction levels. To substantiate our hypothesis, we introduce LEMIX, a framework enabling dynamic analysis of embedded applications by rehosting them as x86 Linux applications decoupled from hardware dependencies. Enabling embedded applications to run natively on Linux facilitates security analysis using available techniques and takes advantage of the powerful hardware available on the Linux platform for higher testing throughput. We develop various techniques to address the challenges involved in converting embedded applications to Linux applications. We evaluated LEMIX on 18 real-world embedded applications across four RTOSes and found 21 new bugs in 12 of the applications and all 4 of the RTOS kernels. We report that LEMIX is superior to existing state-of-the-art techniques both in terms of code coverage (~2x more coverage) and bug detection (18 more bugs).

Title: Guidance Free Image Editing via Explicit Conditioning

Authors: Mehdi Noroozi, Alberto Gil Ramos, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Abhinav Mehrotra, Sourav Bhattacharya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17593
Pdf URL: https://arxiv.org/pdf/2503.17593
Copy Paste: [[2503.17593]] Guidance Free Image Editing via Explicit Conditioning(https://arxiv.org/abs/2503.17593)
Keywords: diffusion
Abstract: Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.

Title: GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners

Authors: Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Junrong Chen, Lin Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17599
Pdf URL: https://arxiv.org/pdf/2503.17599
Copy Paste: [[2503.17599]] GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners(https://arxiv.org/abs/2503.17599)
Keywords: large language model
Abstract: General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services. However, due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings. Currently, Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice. However, most existing benchmarks and evaluation frameworks focus on exam-style assessments-typically multiple-choice question-lack comprehensive assessment sets that accurately mirror the real-world scenarios encountered by GPs. To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework. The test set includes multiple-choice questions that assess fundamental knowledge of general practice, as well as realistic, scenario-based problems. All questions are meticulously annotated by experts, incorporating rich fine-grained information related to clinical management. The proposed LLM evaluation framework is based on the competency model for general practice, providing a comprehensive methodology for assessing LLM performance in real-world settings. As the first large-model evaluation set targeting GP decision-making scenarios, GPBench allows us to evaluate current mainstream LLMs. Expert assessment and evaluation reveal that in areas such as disease staging, complication recognition, treatment detail, and medication usage, these models exhibit at least ten major shortcomings. Overall, existing LLMs are not yet suitable for independent use in real-world GP working scenarios without human oversight.

Title: Unraveling Pedestrian Fatality Patterns: A Comparative Study with Explainable AI

Authors: Methusela Sulle, Judith Mwakalonge, Gurcan Comert, Saidi Siuhi, Nana Kankam Gyimah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17623
Pdf URL: https://arxiv.org/pdf/2503.17623
Copy Paste: [[2503.17623]] Unraveling Pedestrian Fatality Patterns: A Comparative Study with Explainable AI(https://arxiv.org/abs/2503.17623)
Keywords: interpretability
Abstract: Road fatalities pose significant public safety and health challenges worldwide, with pedestrians being particularly vulnerable in vehicle-pedestrian crashes due to disparities in physical and performance characteristics. This study employs explainable artificial intelligence (XAI) to identify key factors contributing to pedestrian fatalities across the five U.S. states with the highest crash rates (2018-2022). It compares them to the five states with the lowest fatality rates. Using data from the Fatality Analysis Reporting System (FARS), the study applies machine learning techniques-including Decision Trees, Gradient Boosting Trees, Random Forests, and XGBoost-to predict contributing factors to pedestrian fatalities. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is utilized, while SHapley Additive Explanations (SHAP) values enhance model interpretability. The results indicate that age, alcohol and drug use, location, and environmental conditions are significant predictors of pedestrian fatalities. The XGBoost model outperformed others, achieving a balanced accuracy of 98 %, accuracy of 90 %, precision of 92 %, recall of 90 %, and an F1 score of 91 %. Findings reveal that pedestrian fatalities are more common in mid-block locations and areas with poor visibility, with older adults and substance-impaired individuals at higher risk. These insights can inform policymakers and urban planners in implementing targeted safety measures, such as improved lighting, enhanced pedestrian infrastructure, and stricter traffic law enforcement, to reduce fatalities and improve public safety.

Title: FairFlow: Mitigating Dataset Biases through Undecided Learning

Authors: Jiali Cheng, Hadi Amiri
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17632
Pdf URL: https://arxiv.org/pdf/2503.17632
Copy Paste: [[2503.17632]] FairFlow: Mitigating Dataset Biases through Undecided Learning(https://arxiv.org/abs/2503.17632)
Keywords: robust, fair
Abstract: Language models are prone to dataset biases, known as shortcuts and spurious correlations in data, which often result in performance drop on new data. We present a new debiasing framework called ``FairFlow'' that mitigates dataset biases by learning to be undecided in its predictions for data samples or representations associated with known or unknown biases. The framework introduces two key components: a suite of data and model perturbation operations that generate different biased views of input samples, and a contrastive objective that learns debiased and robust representations from the resulting biased views of samples. Experiments show that FairFlow outperforms existing debiasing methods, particularly against out-of-domain and hard test samples without compromising the in-domain performance

Title: InstructVEdit: A Holistic Approach for Instructional Video Editing

Authors: Chi Zhang, Chengjian Feng, Feng Yan, Qiming Zhang, Mingjin Zhang, Yujie Zhong, Jing Zhang, Lin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17641
Pdf URL: https://arxiv.org/pdf/2503.17641
Copy Paste: [[2503.17641]] InstructVEdit: A Holistic Approach for Instructional Video Editing(https://arxiv.org/abs/2503.17641)
Keywords: robust
Abstract: Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: this https URL.

Title: On The Sample Complexity Bounds In Bilevel Reinforcement Learning

Authors: Mudit Gaur, Amrit Singh Bedi, Raghu Pasupathu, Vaneet Aggarwal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17644
Pdf URL: https://arxiv.org/pdf/2503.17644
Copy Paste: [[2503.17644]] On The Sample Complexity Bounds In Bilevel Reinforcement Learning(https://arxiv.org/abs/2503.17644)
Keywords: generative
Abstract: Bilevel reinforcement learning (BRL) has emerged as a powerful mathematical framework for studying generative AI alignment and related problems. While several principled algorithmic frameworks have been proposed, key theoretical foundations, particularly those related to sample complexity, remain underexplored. Understanding and deriving tight sample complexity bounds are crucial for bridging the gap between theory and practice, guiding the development of more efficient algorithms. In this work, we present the first sample complexity result for BRL, achieving a bound of $\epsilon^{-4}$. This result extends to standard bilevel optimization problems, providing an interesting theoretical contribution with practical implications. To address the computational challenges associated with hypergradient estimation in bilevel optimization, we develop a first-order Hessian-free algorithm that does not rely on costly hypergradient computations. By leveraging matrix-free techniques and constrained optimization methods, our approach ensures scalability and practicality. Our findings pave the way for improved methods in AI alignment and other fields reliant on bilevel optimization.

Title: Visual Variational Autoencoder Prompt Tuning

Authors: Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, Min Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17650
Pdf URL: https://arxiv.org/pdf/2503.17650
Copy Paste: [[2503.17650]] Visual Variational Autoencoder Prompt Tuning(https://arxiv.org/abs/2503.17650)
Keywords: transformer
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Title: Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion

Authors: Yumeng Ren, Yaofang Liu, Aitor Artola, Laurent Mertz, Raymond H. Chan, Jean-michel Morel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17657
Pdf URL: https://arxiv.org/pdf/2503.17657
Copy Paste: [[2503.17657]] Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion(https://arxiv.org/abs/2503.17657)
Keywords: fair, diffusion
Abstract: Diffusion denoising models have become a popular approach for image generation, but they often suffer from slow convergence during training. In this paper, we identify that this slow convergence is partly due to the complexity of the Brownian motion driving the forward-time process. To address this, we represent the Brownian motion using the Karhunen-Loève expansion, truncating it to a limited number of eigenfunctions. We propose a novel ordinary differential equation with augmented random initials, termed KL diffusion, as a new forward-time process for training and sampling. By developing an appropriate denoising loss function, we facilitate the integration of our KL-diffusion into existing denoising-based models. Using the widely adopted DDIM framework as our baseline ensures a fair comparison, as our modifications focus solely on the forward process and loss function, leaving the network architecture and sampling methods unchanged. Our method significantly outperforms baseline diffusion models, achieving convergence speeds that are twice faster to reach the best FID score of the baseline and ultimately yielding much lower FID scores. Notably, our approach allows for highly parallelized computation, requires no additional learnable parameters, and can be flexibly integrated into existing diffusion methods. The code will be made publicly available.

Title: Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting

Authors: Davide Villaboni, Alberto Castellini, Ivan Luciano Danesi, Alessandro Farinelli
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17658
Pdf URL: https://arxiv.org/pdf/2503.17658
Copy Paste: [[2503.17658]] Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting(https://arxiv.org/abs/2503.17658)
Keywords: transformer
Abstract: Transformer-based time series forecasting has recently gained strong interest due to the ability of transformers to model sequential data. Most of the state-of-the-art architectures exploit either temporal or inter-channel dependencies, limiting their effectiveness in multivariate time-series forecasting where both types of dependencies are crucial. We propose Sentinel, a full transformer-based architecture composed of an encoder able to extract contextual information from the channel dimension, and a decoder designed to capture causal relations and dependencies across the temporal dimension. Additionally, we introduce a multi-patch attention mechanism, which leverages the patching process to structure the input sequence in a way that can be naturally integrated into the transformer architecture, replacing the multi-head splitting process. Extensive experiments on standard benchmarks demonstrate that Sentinel, because of its ability to "monitor" both the temporal and the inter-channel dimension, achieves better or comparable performance with respect to state-of-the-art approaches.

Title: OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding

Authors: Kun Li, Jianhui Wang, Miao Zhang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17660
Pdf URL: https://arxiv.org/pdf/2503.17660
Copy Paste: [[2503.17660]] OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding(https://arxiv.org/abs/2503.17660)
Keywords: diffusion, generative
Abstract: Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Using a diverse multi-turn dialogue dataset, the framework applies multiple reward functions (such as diversity, consistency, and preference feedback) to refine the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.

Title: Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning

Authors: Ke Ji, Yixin Lian, Linxu Li, Jingsheng Gao, Weiyuan Li, Bin Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17662
Pdf URL: https://arxiv.org/pdf/2503.17662
Copy Paste: [[2503.17662]] Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning(https://arxiv.org/abs/2503.17662)
Keywords: large language model
Abstract: In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model's ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human alignment methods are difficult to deploy due to the inherent diversity of model behavior in role-playing scenarios. Inspired by the alignment of models for safety behaviors through RLHF (Reinforcement Learning from Human Feedback), in this paper, we revisit model role-playing behavior from the perspective of persona alignment and propose a novel annotation-free framework named \textbf{\underline{P}}ersona-Aware \textbf{\underline{C}}ontrastive \textbf{\underline{L}}earning (PCL) to align LLMs' behavior during role-playing, enhancing the model's role consistency. Specifically, we first design a role chain method to encourage the model to self-question based on the role characteristics and dialogue context to adjust personality consistency. Then, we further enhance the model's role-playing strategy through iterative contrastive learning between the use of role characteristics and not. Experiments on both black-box and white-box LLMs show that LLMs equipped with PCL significantly outperform vanilla LLMs under automatic evaluation methods (CharEval \& GPT-4) and human expert evaluation.

Title: CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data

Authors: Md. Shaheenur Islam Sumon, Md. Sakib Bin Islam, Md. Sohanur Rahman, Md. Sakib Abrar Hossain, Amith Khandakar, Anwarul Hasan, M Murugappan, Muhammad E. H. Chowdhury
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17664
Pdf URL: https://arxiv.org/pdf/2503.17664
Copy Paste: [[2503.17664]] CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data(https://arxiv.org/abs/2503.17664)
Keywords: transformer
Abstract: The early detection and prediction of cardiovascular diseases are crucial for reducing the severe morbidity and mortality associated with these conditions worldwide. A multi-headed self-attention mechanism, widely used in natural language processing (NLP), is operated by Transformers to understand feature interactions in feature spaces. However, the relationships between various features within biological systems remain ambiguous in these spaces, highlighting the necessity of early detection and prediction of cardiovascular diseases to reduce the severe morbidity and mortality with these conditions worldwide. We handle this issue with CardioTabNet, which exploits the strength of tab transformer to extract feature space which carries strong understanding of clinical cardiovascular data and its feature ranking. As a result, performance of downstream classical models significantly showed outstanding result. Our study utilizes the open-source dataset for heart disease prediction with 1190 instances and 11 features. In total, 11 features are divided into numerical (age, resting blood pressure, cholesterol, maximum heart rate, old peak, weight, and fasting blood sugar) and categorical (resting ECG, exercise angina, and ST slope). Tab transformer was used to extract important features and ranked them using random forest (RF) feature ranking algorithm. Ten machine-learning models were used to predict heart disease using selected features. After extracting high-quality features, the top downstream model (a hyper-tuned ExtraTree classifier) achieved an average accuracy rate of 94.1% and an average Area Under Curve (AUC) of 95.0%. Furthermore, a nomogram analysis was conducted to evaluate the model's effectiveness in cardiovascular risk assessment. A benchmarking study was conducted using state-of-the-art models to evaluate our transformer-driven framework.

Title: 3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System

Authors: Usha Kumari, Shuvendu Rana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17668
Pdf URL: https://arxiv.org/pdf/2503.17668
Copy Paste: [[2503.17668]] 3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System(https://arxiv.org/abs/2503.17668)
Keywords: robust
Abstract: Creating accurate and efficient 3D models poses significant challenges, particularly in addressing large viewpoint variations, computational complexity, and alignment discrepancies. Efficient camera path generation can help resolve these issues. In this context, a modified version of the Affine Scale-Invariant Feature Transform (ASIFT) is proposed to extract more matching points with reduced computational overhead, ensuring an adequate number of inliers for precise camera rotation angle estimation. Additionally, a novel two-camera-based rotation correction model is introduced to mitigate small rotational errors, further enhancing accuracy. Furthermore, a stereo camera-based translation estimation and correction model is implemented to determine camera movement in 3D space by altering the Structure From Motion (SFM) model. Finally, the novel combination of ASIFT and two camera-based SFM models provides an accurate camera movement trajectory in 3D space. Experimental results show that the proposed camera movement approach achieves 99.9% accuracy compared to the actual camera movement path and outperforms state-of-the-art camera path estimation methods. By leveraging this accurate camera path, the system facilitates the creation of precise 3D models, making it a robust solution for applications requiring high fidelity and efficiency in 3D reconstruction.

Title: A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation

Authors: Qing Zhong, Peng-Tao Jiang, Wen Wang, Guodong Ding, Lin Wu, Kaiqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17672
Pdf URL: https://arxiv.org/pdf/2503.17672
Copy Paste: [[2503.17672]] A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation(https://arxiv.org/abs/2503.17672)
Keywords: segmentation
Abstract: Contemporary Video Instance Segmentation (VIS) methods typically adhere to a pre-train then fine-tune regime, where a segmentation model trained on images is fine-tuned on videos. However, the lack of temporal knowledge in the pre-trained model introduces a domain gap which may adversely affect the VIS performance. To effectively bridge this gap, we present a novel video pre-training approach to enhance VIS models, especially for videos with intricate instance relationships. Our crucial innovation focuses on reducing disparities between the pre-training and fine-tuning stages. Specifically, we first introduce consistent pseudo-video augmentations to create diverse pseudo-video samples for pre-training while maintaining the instance consistency across frames. Then, we incorporate a multi-scale temporal module to enhance the model's ability to model temporal relations through self- and cross-attention at short- and long-term temporal spans. Our approach does not set constraints on model architecture and can integrate seamlessly with various VIS methods. Experiment results on commonly adopted VIS benchmarks show that our method consistently outperforms state-of-the-art methods. Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.

Title: DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion

Authors: Jinyuan Liu, Bowei Zhang, Qingyun Mei, Xingyuan Li, Yang Zou, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17673
Pdf URL: https://arxiv.org/pdf/2503.17673
Copy Paste: [[2503.17673]] DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion(https://arxiv.org/abs/2503.17673)
Keywords: robust
Abstract: Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. Leveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm (EA) to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Embedding (CDE) block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks. The code is available at this https URL.

Title: Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Authors: Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17675
Pdf URL: https://arxiv.org/pdf/2503.17675
Copy Paste: [[2503.17675]] Towards Transformer-Based Aligned Generation with Self-Coherence Guidance(https://arxiv.org/abs/2503.17675)
Keywords: diffusion, transformer
Abstract: We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at this https URL.

Title: Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

Authors: Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17682
Pdf URL: https://arxiv.org/pdf/2503.17682
Copy Paste: [[2503.17682]] Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models(https://arxiv.org/abs/2503.17682)
Keywords: attack, large language model
Abstract: Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at this https URL to support the safety development of MLLMs and reduce potential societal risks.

Title: Decentralized Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation

Authors: Rebecca Clain, Eduardo Fernandes Montesuma, Fred Ngolè Mboula
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17683
Pdf URL: https://arxiv.org/pdf/2503.17683
Copy Paste: [[2503.17683]] Decentralized Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation(https://arxiv.org/abs/2503.17683)
Keywords: privacy, robust, federate
Abstract: Decentralized Multi-Source Domain Adaptation (DMSDA) is a challenging task that aims to transfer knowledge from multiple related and heterogeneous source domains to an unlabeled target domain within a decentralized framework. Our work tackles DMSDA through a fully decentralized federated approach. In particular, we extend the Federated Dataset Dictionary Learning (FedDaDiL) framework by eliminating the necessity for a central server. FedDaDiL leverages Wasserstein barycenters to model the distributional shift across multiple clients, enabling effective adaptation while preserving data privacy. By decentralizing this framework, we enhance its robustness, scalability, and privacy, removing the risk of a single point of failure. We compare our method to its federated counterpart and other benchmark algorithms, showing that our approach effectively adapts source domains to an unlabeled target domain in a fully decentralized manner.

Title: CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model

Authors: Ziyu Yao, Xuxin Cheng, Zhiqi Huang, Lei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17690
Pdf URL: https://arxiv.org/pdf/2503.17690
Copy Paste: [[2503.17690]] CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model(https://arxiv.org/abs/2503.17690)
Keywords: large language model
Abstract: Repetitive action counting, which aims to count periodic movements in a video, is valuable for video analysis applications such as fitness monitoring. However, existing methods largely rely on regression networks with limited representational capacity, which hampers their ability to accurately capture variable periodic patterns. Additionally, their supervised learning on narrow, limited training sets leads to overfitting and restricts their ability to generalize across diverse scenarios. To address these challenges, we propose CountLLM, the first large language model (LLM)-based framework that takes video data and periodic text prompts as inputs and outputs the desired counting value. CountLLM leverages the rich clues from explicit textual instructions and the powerful representational capabilities of pre-trained LLMs for repetitive action counting. To effectively guide CountLLM, we develop a periodicity-based structured template for instructions that describes the properties of periodicity and implements a standardized answer format to ensure consistency. Additionally, we propose a progressive multimodal training paradigm to enhance the periodicity-awareness of the LLM. Empirical evaluations on widely recognized benchmarks demonstrate CountLLM's superior performance and generalization, particularly in handling novel and out-of-domain actions that deviate significantly from the training data, offering a promising avenue for repetitive action counting.

Title: MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

Authors: Yikun Ma, Yiqing Li, Jiawei Wu, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17695
Pdf URL: https://arxiv.org/pdf/2503.17695
Copy Paste: [[2503.17695]] MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion(https://arxiv.org/abs/2503.17695)
Keywords: diffusion, generative
Abstract: Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.

Title: MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

Authors: Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, Jianan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17699
Pdf URL: https://arxiv.org/pdf/2503.17699
Copy Paste: [[2503.17699]] MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking(https://arxiv.org/abs/2503.17699)
Keywords: transformer
Abstract: UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on this https URL.

Title: Multi-modality Anomaly Segmentation on the Road

Authors: Heng Gao, Zhuolin He, Shoumeng Qiu, Xiangyang Xue, Jian Pu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17712
Pdf URL: https://arxiv.org/pdf/2503.17712
Copy Paste: [[2503.17712]] Multi-modality Anomaly Segmentation on the Road(https://arxiv.org/abs/2503.17712)
Keywords: segmentation
Abstract: Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in this https URL.

Title: Normalized Matching Transformer

Authors: Abtin Pourhadi, Paul Swoboda
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17715
Pdf URL: https://arxiv.org/pdf/2503.17715
Copy Paste: [[2503.17715]] Normalized Matching Transformer(https://arxiv.org/abs/2503.17715)
Keywords: transformer
Abstract: We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by $5.1\%$ and $2.2\%$ respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least $1.7x$ fewer epochs.

Title: EMPLACE: Self-Supervised Urban Scene Change Detection

Authors: Tim Alpherts, Sennay Ghebreab, Nanne van Noord
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17716
Pdf URL: https://arxiv.org/pdf/2503.17716
Copy Paste: [[2503.17716]] EMPLACE: Self-Supervised Urban Scene Change Detection(https://arxiv.org/abs/2503.17716)
Keywords: transformer
Abstract: Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

Title: Towards Invisible Backdoor Attack on Text-to-Image Diffusion Model

Authors: Jie Zhang, Zhongqi Wang, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17724
Pdf URL: https://arxiv.org/pdf/2503.17724
Copy Paste: [[2503.17724]] Towards Invisible Backdoor Attack on Text-to-Image Diffusion Model(https://arxiv.org/abs/2503.17724)
Keywords: defense, attack, steal, diffusion
Abstract: Backdoor attacks targeting text-to-image diffusion models have advanced rapidly, enabling attackers to implant malicious triggers into these models to manipulate their outputs. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. To enhance the stealthiness of backdoor samples, we propose a novel Invisible Backdoor Attack (IBA) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our IBA achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses, with an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms. The code is available at this https URL.

Title: DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis

Authors: Yongjin Choi, Chanhun Park, Seung Jun Baek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17728
Pdf URL: https://arxiv.org/pdf/2503.17728
Copy Paste: [[2503.17728]] DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis(https://arxiv.org/abs/2503.17728)
Keywords: diffusion
Abstract: Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.

Title: Co-op: Correspondence-based Novel Object Pose Estimation

Authors: Sungphill Moon, Hyeontae Son, Dongcheol Hur, Sangwook Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17731
Pdf URL: https://arxiv.org/pdf/2503.17731
Copy Paste: [[2503.17731]] Co-op: Correspondence-based Novel Object Pose Estimation(https://arxiv.org/abs/2503.17731)
Keywords: robust
Abstract: We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.

Title: Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

Authors: Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17739
Pdf URL: https://arxiv.org/pdf/2503.17739
Copy Paste: [[2503.17739]] Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection(https://arxiv.org/abs/2503.17739)
Keywords: transformer, large language model
Abstract: Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.

Title: Serial Low-rank Adaptation of Vision Transformer

Authors: Houqiang Zhong, Shaocheng Shen, Ke Cai, Zhenglong Wu, Jiangchao Yao, Yuan Cheng, Xuefei Li, Xiaoyun Zhang, Li Song, Qiang Hu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.17750
Pdf URL: https://arxiv.org/pdf/2503.17750
Copy Paste: [[2503.17750]] Serial Low-rank Adaptation of Vision Transformer(https://arxiv.org/abs/2503.17750)
Keywords: transformer
Abstract: Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.

Title: HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

Authors: R.D. Lin, Pengcheng Weng, Yinqiao Wang, Han Ding, Jinsong Han, Fei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17752
Pdf URL: https://arxiv.org/pdf/2503.17752
Copy Paste: [[2503.17752]] HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving(https://arxiv.org/abs/2503.17752)
Keywords: segmentation
Abstract: LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on this https URL

Title: Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information

Authors: Hojun Cho, Donghu Kim, Soyoung Yang, Chan Lee, Hunjoo Lee, Jaegul Choo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17753
Pdf URL: https://arxiv.org/pdf/2503.17753
Copy Paste: [[2503.17753]] Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information(https://arxiv.org/abs/2503.17753)
Keywords: large language model
Abstract: Language agents powered by large language models (LLMs) face significant deployment challenges in resource-constrained environments, particularly for specialized domains and less-common languages. This paper presents Tox-chat, a Korean chemical toxicity information agent devised within these limitations. We propose two key innovations: a context-efficient architecture that reduces token consumption through hierarchical section search, and a scenario-based dialogue generation methodology that effectively distills tool-using capabilities from larger models. Experimental evaluations demonstrate that our fine-tuned 8B parameter model substantially outperforms both untuned models and baseline approaches, in terms of DB faithfulness and preference. Our work offers valuable insights for researchers developing domain-specific language agents under practical constraints.

Title: Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes

Authors: Sharan Maiya, Yinhong Liu, Ramit Debnath, Anna Korhonen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.17755
Pdf URL: https://arxiv.org/pdf/2503.17755
Copy Paste: [[2503.17755]] Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes(https://arxiv.org/abs/2503.17755)
Keywords: robust, extraction, large language model
Abstract: Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs' latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.

Title: Bandwidth Reservation for Time-Critical Vehicular Applications: A Multi-Operator Environment

Authors: Abdullah Al-Khatib, Abdullah Ahmed, Klaus Moessner, Holger Timinger
Subjects: cs.LG, cs.AI, cs.CR, cs.NE
Abstract URL: https://arxiv.org/abs/2503.17756
Pdf URL: https://arxiv.org/pdf/2503.17756
Copy Paste: [[2503.17756]] Bandwidth Reservation for Time-Critical Vehicular Applications: A Multi-Operator Environment(https://arxiv.org/abs/2503.17756)
Keywords: fair, transformer
Abstract: Onsite bandwidth reservation requests often face challenges such as price fluctuations and fairness issues due to unpredictable bandwidth availability and stringent latency requirements. Requesting bandwidth in advance can mitigate the impact of these fluctuations and ensure timely access to critical resources. In a multi-Mobile Network Operator (MNO) environment, vehicles need to select cost-effective and reliable resources for their safety-critical applications. This research aims to minimize resource costs by finding the best price among multiple MNOs. It formulates multi-operator scenarios as a Markov Decision Process (MDP), utilizing a Deep Reinforcement Learning (DRL) algorithm, specifically Dueling Deep Q-Learning. For efficient and stable learning, we propose a novel area-wise approach and an adaptive MDP synthetic close to the real environment. The Temporal Fusion Transformer (TFT) is used to handle time-dependent data and model training. Furthermore, the research leverages Amazon spot price data and adopts a multi-phase training approach, involving initial training on synthetic data, followed by real-world data. These phases enable the DRL agent to make informed decisions using insights from historical data and real-time observations. The results show that our model leads to significant cost reductions, up to 40%, compared to scenarios without a policy model in such a complex environment.

Title: Design and implementation of a novel cryptographically secure pseudorandom number generator

Authors: Juan Di Mauro, Eduardo Salazar, Hugo D. Scolnik
Subjects: cs.CR, math.NA, math.NT
Abstract URL: https://arxiv.org/abs/2503.17767
Pdf URL: https://arxiv.org/pdf/2503.17767
Copy Paste: [[2503.17767]] Design and implementation of a novel cryptographically secure pseudorandom number generator(https://arxiv.org/abs/2503.17767)
Keywords: secure
Abstract: The aim of this paper is to present a new design for a pseudorandom number generator (PRNG) that is cryptographically secure, passes all of the usual statistical tests referenced in the literature and hence generates high quality random sequences, that is compact and easy to implement in practice, of portable design and offering reasonable execution times. Our procedure achieves those objectives through the use of a sequence of modular exponentiations followed by the application of Feistel-like boxes that mix up bits using a nonlinear function. The results of extensive statistical tests on sequences of about 2^40 bits in size generated by our algorithm are also presented.

Title: Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction

Authors: Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17788
Pdf URL: https://arxiv.org/pdf/2503.17788
Copy Paste: [[2503.17788]] Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction(https://arxiv.org/abs/2503.17788)
Keywords: robust, diffusion, segmentation
Abstract: Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely align hand poses and interactions by synergistically integrating foundation model-driven 2D priors with diffusion-based interaction refinement for occlusion-resistant two-hand reconstruction. First, we introduce a Fusion Alignment Encoder that learns to align fused multimodal priors keypoints, segmentation maps, and depth cues from foundation models during training. This provides robust structured guidance, further enabling efficient inference without foundation models at test time while maintaining high reconstruction accuracy. Second, we employ a two-hand diffusion model explicitly trained to transform interpenetrated poses into plausible, non-penetrated interactions, leveraging gradient-guided denoising to correct artifacts and ensure realistic spatial relations. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, FreiHAND, and HIC datasets, significantly advancing occlusion handling and interaction robustness.

Title: Topology preserving Image segmentation using the iterative convolution-thresholding method

Authors: Lingyun Deng, Litong Liu, Dong Wang, Xiao-Ping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17792
Pdf URL: https://arxiv.org/pdf/2503.17792
Copy Paste: [[2503.17792]] Topology preserving Image segmentation using the iterative convolution-thresholding method(https://arxiv.org/abs/2503.17792)
Keywords: robust, segmentation
Abstract: Variational models are widely used in image segmentation, with various models designed to address different types of images by optimizing specific objective functionals. However, traditional segmentation models primarily focus on the visual attributes of the image, often neglecting the topological properties of the target objects. This limitation can lead to segmentation results that deviate from the ground truth, particularly in images with complex topological structures. In this paper, we introduce a topology-preserving constraint into the iterative convolution-thresholding method (ICTM), resulting in the topology-preserving ICTM (TP-ICTM). Extensive experiments demonstrate that, by explicitly preserving the topological properties of target objects-such as connectivity-the proposed algorithm achieves enhanced accuracy and robustness, particularly in images with intricate structures or noise.

Title: Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Authors: Codefuse, Ling Team: Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zheng, Jun Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17793
Pdf URL: https://arxiv.org/pdf/2503.17793
Copy Paste: [[2503.17793]] Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM(https://arxiv.org/abs/2503.17793)
Keywords: large language model
Abstract: Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{this https URL}.

Title: Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

Authors: Ketan Suhaas Saichandran, Xavier Thomas, Prakhar Kaushik, Deepti Ghadiyaram
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17794
Pdf URL: https://arxiv.org/pdf/2503.17794
Copy Paste: [[2503.17794]] Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models(https://arxiv.org/abs/2503.17794)
Keywords: diffusion, generative
Abstract: Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.

Title: Relation Extraction with Instance-Adapted Predicate Descriptions

Authors: Yuhang Jiang, Ramakanth Kavuluru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17799
Pdf URL: https://arxiv.org/pdf/2503.17799
Copy Paste: [[2503.17799]] Relation Extraction with Instance-Adapted Predicate Descriptions(https://arxiv.org/abs/2503.17799)
Keywords: extraction, generative, large language model
Abstract: Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.

Title: A Roadmap Towards Improving Multi-Agent Reinforcement Learning With Causal Discovery And Inference

Authors: Giovanni Briglia, Stefano Mariani, Franco Zambonelli
Subjects: cs.LG, cs.AI, cs.MA, stat.ME
Abstract URL: https://arxiv.org/abs/2503.17803
Pdf URL: https://arxiv.org/pdf/2503.17803
Copy Paste: [[2503.17803]] A Roadmap Towards Improving Multi-Agent Reinforcement Learning With Causal Discovery And Inference(https://arxiv.org/abs/2503.17803)
Keywords: interpretability
Abstract: Causal reasoning is increasingly used in Reinforcement Learning (RL) to improve the learning process in several dimensions: efficacy of learned policies, efficiency of convergence, generalisation capabilities, safety and interpretability of behaviour. However, applications of causal reasoning to Multi-Agent RL (MARL) are still mostly unexplored. In this paper, we take the first step in investigating the opportunities and challenges of applying causal reasoning in MARL. We measure the impact of a simple form of causal augmentation in state-of-the-art MARL scenarios increasingly requiring cooperation, and with state-of-the-art MARL algorithms exploiting various degrees of collaboration between agents. Then, we discuss the positive as well as negative results achieved, giving us the chance to outline the areas where further research may help to successfully transfer causal RL to the multi-agent setting.

Title: Neural Network Approach to Stochastic Dynamics for Smooth Multimodal Density Estimation

Authors: Z. Zarezadeh, N. Zarezadeh
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17807
Pdf URL: https://arxiv.org/pdf/2503.17807
Copy Paste: [[2503.17807]] Neural Network Approach to Stochastic Dynamics for Smooth Multimodal Density Estimation(https://arxiv.org/abs/2503.17807)
Keywords: diffusion
Abstract: In this paper we consider a new probability sampling methods based on Langevin diffusion dynamics to resolve the problem of existing Monte Carlo algorithms when draw samples from high dimensional target densities. We extent Metropolis-Adjusted Langevin Diffusion algorithm by modelling the stochasticity of precondition matrix as a random matrix. An advantage compared to other proposal method is that it only requires the gradient of log-posterior. The proposed method provides fully adaptation mechanisms to tune proposal densities to exploits and adapts the geometry of local structures of statistical models. We clarify the benefits of the new proposal by modelling a Quantum Probability Density Functions of a free particle in a plane (energy Eigen-functions). The proposed model represents a remarkable improvement in terms of performance accuracy and computational time over standard MCMC method.

Title: Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Authors: Wenqi Pei, Hailing Xu, Hengyuan Zhao, Shizheng Hou, Han Chen, Zining Zhang, Pingyi Luo, Bingsheng He
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.17811
Pdf URL: https://arxiv.org/pdf/2503.17811
Copy Paste: [[2503.17811]] Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models(https://arxiv.org/abs/2503.17811)
Keywords: privacy, large language model
Abstract: Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source systems and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through 1) schema pruning and linking, 2) multi-path and multi-candidate generation. Additionally, we introduce the 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL specialist, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.

Title: Connectedness: a dimension of security bug severity assessment for measuring uncertainty

Authors: Shue Long Chan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.17813
Pdf URL: https://arxiv.org/pdf/2503.17813
Copy Paste: [[2503.17813]] Connectedness: a dimension of security bug severity assessment for measuring uncertainty(https://arxiv.org/abs/2503.17813)
Keywords: security
Abstract: Current frameworks for evaluating security bug severity, such as the Common Vulnerability Scoring System (CVSS), prioritize the ratio of exploitability to impact. This paper suggests that the above approach measures the "known knowns" but inadequately addresses the "known unknowns" especially when there exist multiple possible exploit paths and side effects, which introduce significant uncertainty. This paper introduces the concept of connectedness, which measures how strongly a security bug is connected with different entities, thereby reflecting the uncertainty of impact and the exploit potential. This work highlights the critical but underappreciated role connectedness plays in severity assessments.

Title: RefCut: Interactive Segmentation with Reference Guidance

Authors: Zheng Lin, Nan Zhou, Chen-Xi Du, Deng-Ping Fan, Shi-Min Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17820
Pdf URL: https://arxiv.org/pdf/2503.17820
Copy Paste: [[2503.17820]] RefCut: Interactive Segmentation with Reference Guidance(https://arxiv.org/abs/2503.17820)
Keywords: segmentation
Abstract: Interactive segmentation aims to segment the specified target on the image with positive and negative clicks from users. Interactive ambiguity is a crucial issue in this field, which refers to the possibility of multiple compliant outcomes with the same clicks, such as selecting a part of an object versus the entire object, a single object versus a combination of multiple objects, and so on. The existing methods cannot provide intuitive guidance to the model, which leads to unstable output results and makes it difficult to meet the large-scale and efficient annotation requirements for specific targets in some scenarios. To bridge this gap, we introduce RefCut, a reference-based interactive segmentation framework designed to address part ambiguity and object ambiguity in segmenting specific targets. Users only need to provide a reference image and corresponding reference masks, and the model will be optimized based on them, which greatly reduces the interactive burden on users when annotating a large number of such targets. In addition, to enrich these two kinds of ambiguous data, we propose a new Target Disassembly Dataset which contains two subsets of part disassembly and object disassembly for evaluation. In the combination evaluation of multiple datasets, our RefCut achieved state-of-the-art performance. Extensive experiments and visualized results demonstrate that RefCut advances the field of intuitive and controllable interactive segmentation. Our code will be publicly available and the demo video is in this https URL.

Title: Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration

Authors: Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17825
Pdf URL: https://arxiv.org/pdf/2503.17825
Copy Paste: [[2503.17825]] Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration(https://arxiv.org/abs/2503.17825)
Keywords: transformer
Abstract: While vision transformers achieve significant breakthroughs in various image restoration (IR) tasks, it is still challenging to efficiently scale them across multiple types of degradations and resolutions. In this paper, we propose Fractal-IR, a fractal-based design that progressively refines degraded images by repeatedly expanding local information into broader regions. This fractal architecture naturally captures local details at early stages and seamlessly transitions toward global context in deeper fractal stages, removing the need for computationally heavy long-range self-attention mechanisms. Moveover, we observe the challenge in scaling up vision transformers for IR tasks. Through a series of analyses, we identify a holistic set of strategies to effectively guide model scaling. Extensive experimental results show that Fractal-IR achieves state-of-the-art performance in seven common image restoration tasks, including super-resolution, denoising, JPEG artifact removal, IR in adverse weather conditions, motion deblurring, defocus deblurring, and demosaicking. For $2\times$ SR on Manga109, Fractal-IR achieves a 0.21 dB PSNR gain. For grayscale image denoising on Urban100, Fractal-IR surpasses the previous method by 0.2 dB for $\sigma=50$.

Title: 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Authors: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17827
Pdf URL: https://arxiv.org/pdf/2503.17827
Copy Paste: [[2503.17827]] 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding(https://arxiv.org/abs/2503.17827)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

Title: Fingerprinting Implementations of Cryptographic Primitives and Protocols that Use Post-Quantum Algorithms

Authors: Tushin Mallick, Ramana Kompella, Ashish Kundu, Cristina Nita-Rotaru
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.17830
Pdf URL: https://arxiv.org/pdf/2503.17830
Copy Paste: [[2503.17830]] Fingerprinting Implementations of Cryptographic Primitives and Protocols that Use Post-Quantum Algorithms(https://arxiv.org/abs/2503.17830)
Keywords: attack
Abstract: Fingerprinting is a technique used to create behavioral profiles of systems to identify threats and weaknesses. When applied to cryptographic primitives and network protocols, it can be exploited by attackers for denial-of-service, key recovery, or downgrade attacks. In this paper, we evaluate the feasibility of fingerprinting post-quantum (PQ) algorithms by analyzing key exchange and digital signature primitives, their integration into protocols like TLS, SSH, QUIC, OpenVPN, and OIDC, and their usage in SNARK libraries (pysnark and lattice_zksnark). PQ algorithms differ from classical ones in memory and computation demands. We examine implementations across liboqs and CIRCL libraries on Windows, Ubuntu, and MacOS. Our experiments show that we can distinguish classical from PQ key exchange and signatures with 98% and 100% accuracy, respectively; identify the specific PQ algorithm used with 97% and 86% accuracy; distinguish between liboqs and CIRCL implementations with up to 100% accuracy; and identify PQ vs. hybrid implementations within CIRCL with 97% accuracy. In protocol-level analysis, we can detect the presence and type of PQ key exchange. SNARK libraries are distinguishable with 100% accuracy. To demonstrate real-world applicability, we apply our fingerprinting methods to the Tranco dataset to detect domains using PQ TLS and integrate our methods into QUARTZ, an open-source threat analysis tool developed by Cisco.

Title: Adapt, Agree, Aggregate: Semi-Supervised Ensemble Labeling for Graph Convolutional Networks

Authors: Maryam Abdolali, Romina Zakerian, Behnam Roshanfekr, Fardin Ayar, Mohammad Rahmati
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17842
Pdf URL: https://arxiv.org/pdf/2503.17842
Copy Paste: [[2503.17842]] Adapt, Agree, Aggregate: Semi-Supervised Ensemble Labeling for Graph Convolutional Networks(https://arxiv.org/abs/2503.17842)
Keywords: robust, extraction
Abstract: In this paper, we propose a novel framework that combines ensemble learning with augmented graph structures to improve the performance and robustness of semi-supervised node classification in graphs. By creating multiple augmented views of the same graph, our approach harnesses the "wisdom of a diverse crowd", mitigating the challenges posed by noisy graph structures. Leveraging ensemble learning allows us to simultaneously achieve three key goals: adaptive confidence threshold selection based on model agreement, dynamic determination of the number of high-confidence samples for training, and robust extraction of pseudo-labels to mitigate confirmation bias. Our approach uniquely integrates adaptive ensemble consensus to flexibly guide pseudo-label extraction and sample selection, reducing the risks of error accumulation and improving robustness. Furthermore, the use of ensemble-driven consensus for pseudo-labeling captures subtle patterns that individual models often overlook, enabling the model to generalize better. Experiments on several real-world datasets demonstrate the effectiveness of our proposed method.

Title: NVBleed: Covert and Side-Channel Attacks on NVIDIA Multi-GPU Interconnect

Authors: Yicheng Zhang, Ravan Nazaraliyev, Sankha Baran Dutta, Andres Marquez, Kevin Barker, Nael Abu-Ghazaleh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.17847
Pdf URL: https://arxiv.org/pdf/2503.17847
Copy Paste: [[2503.17847]] NVBleed: Covert and Side-Channel Attacks on NVIDIA Multi-GPU Interconnect(https://arxiv.org/abs/2503.17847)
Keywords: defense, attack
Abstract: Multi-GPU systems are becoming increasingly important in highperformance computing (HPC) and cloud infrastructure, providing acceleration for data-intensive applications, including machine learning workloads. These systems consist of multiple GPUs interconnected through high-speed networking links such as NVIDIA's NVLink. In this work, we explore whether the interconnect on such systems can offer a novel source of leakage, enabling new forms of covert and side-channel attacks. Specifically, we reverse engineer the operations of NVlink and identify two primary sources of leakage: timing variations due to contention and accessible performance counters that disclose communication patterns. The leakage is visible remotely and even across VM instances in the cloud, enabling potentially dangerous attacks. Building on these observations, we develop two types of covert-channel attacks across two GPUs, achieving a bandwidth of over 70 Kbps with an error rate of 4.78% for the contention channel. We develop two end-to-end crossGPU side-channel attacks: application fingerprinting (including 18 high-performance computing and deep learning applications) and 3D graphics character identification within Blender, a multi-GPU rendering application. These attacks are highly effective, achieving F1 scores of up to 97.78% and 91.56%, respectively. We also discover that leakage surprisingly occurs across Virtual Machines on the Google Cloud Platform (GCP) and demonstrate a side-channel attack on Blender, achieving F1 scores exceeding 88%. We also explore potential defenses such as managing access to counters and reducing the resolution of the clock to mitigate the two sources of leakage.

Title: ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Authors: Radu Beche, Sergiu Nedevschi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17856
Pdf URL: https://arxiv.org/pdf/2503.17856
Copy Paste: [[2503.17856]] ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling(https://arxiv.org/abs/2503.17856)
Keywords: segmentation
Abstract: The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. Currently under review, upon acceptance the data and code will be available at $\href{this https URL}{this http URL}$.

Title: Detecting and Mitigating DDoS Attacks with AI: A Survey

Authors: Alexandru Apostu, Silviu Gheorghe, Andrei Hîji, Nicolae Cleju, Andrei Pătraşcu, Cristian Rusu, Radu Ionescu, Paul Irofti
Subjects: cs.CR, cs.AI, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2503.17867
Pdf URL: https://arxiv.org/pdf/2503.17867
Copy Paste: [[2503.17867]] Detecting and Mitigating DDoS Attacks with AI: A Survey(https://arxiv.org/abs/2503.17867)
Keywords: security, defense, attack
Abstract: Distributed Denial of Service attacks represent an active cybersecurity research problem. Recent research shifted from static rule-based defenses towards AI-based detection and mitigation. This comprehensive survey covers several key topics. Preeminently, state-of-the-art AI detection methods are discussed. An in-depth taxonomy based on manual expert hierarchies and an AI-generated dendrogram are provided, thus settling DDoS categorization ambiguities. An important discussion on available datasets follows, covering data format options and their role in training AI detection methods together with adversarial training and examples augmentation. Beyond detection, AI based mitigation techniques are surveyed as well. Finally, multiple open research directions are proposed.

Title: A Distributed Blockchain-based Access Control for the Internet of Things

Authors: Ebtihal Abdulrahman, Suhair Alshehri, Ali Alzubaidy, Asma Cherif
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.17873
Pdf URL: https://arxiv.org/pdf/2503.17873
Copy Paste: [[2503.17873]] A Distributed Blockchain-based Access Control for the Internet of Things(https://arxiv.org/abs/2503.17873)
Keywords: security, privacy
Abstract: Recently, the Internet of Things (IoT) environment has become increasingly fertile for malicious users to break the security and privacy of IoT users. Access control is a paramount necessity to forestall illicit access. Traditional access control mechanisms are designed and managed in a centralized manner, thus rendering them unfit for decentralized IoT systems. To address the distributed IoT environment, blockchain is viewed as a promising decentralised data management technology. In this thesis, we investigate the state-of-art works in the domain of distributed blockchain-based access control. We establish the most important requirements and assess related works against them. We propose a Distributed Blockchain and Attribute-based Access Control model for IoT entitled (DBC-ABAC) that merges blockchain technology with the attribute-based access control model. A proof-of-concept implementation is presented using Hyperledger Fabric. To validate performance, we experimentally evaluate and compare our work with other recent works using Hyperledger Caliper tool. Results indicate that the proposed model surpasses other works in terms of latency and throughput with considerable efficiency.

Title: Satisfactory Medical Consultation based on Terminology-Enhanced Information Retrieval and Emotional In-Context Learning

Authors: Kaiwen Zuo, Jing Tang, Hanbing Qin, Binli Luo, Ligang He, Shiyan Tang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.17876
Pdf URL: https://arxiv.org/pdf/2503.17876
Copy Paste: [[2503.17876]] Satisfactory Medical Consultation based on Terminology-Enhanced Information Retrieval and Emotional In-Context Learning(https://arxiv.org/abs/2503.17876)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have marked significant progress in understanding and responding to medical inquiries. However, their performance still falls short of the standards set by professional consultations. This paper introduces a novel framework for medical consultation, comprising two main modules: Terminology-Enhanced Information Retrieval (TEIR) and Emotional In-Context Learning (EICL). TEIR ensures implicit reasoning through the utilization of inductive knowledge and key terminology retrieval, overcoming the limitations of restricted domain knowledge in public databases. Additionally, this module features capabilities for processing long context. The EICL module aids in generating sentences with high attribute relevance by memorizing semantic and attribute information from unlabelled corpora and applying controlled retrieval for the required information. Furthermore, a dataset comprising 803,564 consultation records was compiled in China, significantly enhancing the model's capability for complex dialogues and proactive inquiry initiation. Comprehensive experiments demonstrate the proposed method's effectiveness in extending the context window length of existing LLMs. The experimental outcomes and extensive data validate the framework's superiority over five baseline models in terms of BLEU and ROUGE performance metrics, with substantial leads in certain capabilities. Notably, ablation studies confirm the significance of the TEIR and EICL components. In addition, our new framework has the potential to significantly improve patient satisfaction in real clinical consulting situations.

Title: Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

Authors: Shengyun Si, Xinpeng Wang, Guangyao Zhai, Nassir Navab, Barbara Plank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17882
Pdf URL: https://arxiv.org/pdf/2503.17882
Copy Paste: [[2503.17882]] Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior(https://arxiv.org/abs/2503.17882)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such "harmlessness" behavior is mainly achieved by training models to reject harmful requests, such as "Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as "Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

Title: Understanding and Mitigating Side and Covert Channel Vulnerabilities Introduced by RowHammer Defenses

Authors: F. Nisa Bostancı, Oğuzhan Canpolat, Ataberk Olgun, İsmail Emir Yüksel, Mohammad Sadrosadati, A. Giray Yağlıkçı, Onur Mutlu
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2503.17891
Pdf URL: https://arxiv.org/pdf/2503.17891
Copy Paste: [[2503.17891]] Understanding and Mitigating Side and Covert Channel Vulnerabilities Introduced by RowHammer Defenses(https://arxiv.org/abs/2503.17891)
Keywords: security, defense, attack, robust
Abstract: DRAM chips are vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows, due to DRAM density scaling. Attackers can leverage RowHammer bitflips in real systems to take over systems and leak data. Consequently, many prior works propose mitigations, including recent DDR specifications introducing new mitigation frameworks (e.g., PRAC and RFM). For robustness, it is timely and critical to analyze other security implications that widely-adopted RowHammer mitigations can introduce. Unfortunately, no prior work analyzes the timing channel vulnerabilities introduced by RowHammer mitigations. In this work, we present the first analysis and evaluation of timing channel vulnerabilities introduced by RowHammer mitigations. Our key observation is that RowHammer mitigations' preventive actions have two features that enable timing channels. First, preventive actions often reduce DRAM bandwidth availability because they block access to DRAM, thereby delaying regular memory requests and resulting in increased memory latencies. Second, preventive actions can be triggered on demand as they depend on memory access patterns. We systematically analyze two latest industry mitigations and introduce LeakyHammer, a new class of attacks that leverage the RowHammer mitigation-induced memory latency differences to establish communication channels between processes and leak secrets. First, we build two covert channel attacks exploiting two state-of-the-art RowHammer mitigations, providing 41.9 Kbps and 54.0 Kbps channel capacity. Second, we demonstrate a proof-of-concept website fingerprinting attack that can identify visited websites based on the RowHammer mitigation behavior. We discuss 3 mitigations against LeakyHammer and show that fundamentally mitigating LeakyHammer induces significant performance overheads.

Title: MedPlan:A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Authors: Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Chun-Chieh Liao, Pengfei Hu, Xiaoxue Han, Chih-Ho Hsu, Dongsheng Luo, Wen-Chih Peng, Feng Liu, Fang-Ming Hung, Chenwei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17900
Pdf URL: https://arxiv.org/pdf/2503.17900
Copy Paste: [[2503.17900]] MedPlan:A Two-Stage RAG-Based System for Personalized Medical Plan Generation(https://arxiv.org/abs/2503.17900)
Keywords: large language model
Abstract: Despite recent success in applying large language models (LLMs) to electronic health records (EHR), most systems focus primarily on assessment rather than treatment planning. We identify three critical limitations in current approaches: they generate treatment plans in a single pass rather than following the sequential reasoning process used by clinicians; they rarely incorporate patient-specific historical context; and they fail to effectively distinguish between subjective and objective clinical information. Motivated by the SOAP methodology (Subjective, Objective, Assessment, Plan), we introduce MedPlan, a novel framework that structures LLM reasoning to align with real-life clinician workflows. Our approach employs a two-stage architecture that first generates a clinical assessment based on patient symptoms and objective data, then formulates a structured treatment plan informed by this assessment and enriched with patient-specific information through retrieval-augmented generation. Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.

Title: GLADMamba: Unsupervised Graph-Level Anomaly Detection Powered by Selective State Space Model

Authors: Yali Fu, Jindong Li, Qi Wang, Qianli Xing
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17903
Pdf URL: https://arxiv.org/pdf/2503.17903
Copy Paste: [[2503.17903]] GLADMamba: Unsupervised Graph-Level Anomaly Detection Powered by Selective State Space Model(https://arxiv.org/abs/2503.17903)
Keywords: transformer
Abstract: Unsupervised graph-level anomaly detection (UGLAD) is a critical and challenging task across various domains, such as social network analysis, anti-cancer drug discovery, and toxic molecule identification. However, existing methods often struggle to capture the long-range dependencies efficiently and neglect the spectral information. Recently, selective State Space Models (SSMs), particularly Mamba, have demonstrated remarkable advantages in capturing long-range dependencies with linear complexity and a selection mechanism. Motivated by their success across various domains, we propose GLADMamba, a novel framework that adapts the selective state space model into UGLAD field. We design View-Fused Mamba (VFM) with a Mamba-Transformer-style architecture to efficiently fuse information from different views with a selective state mechanism. We also design Spectrum-Guided Mamba (SGM) with a Mamba-Transformer-style architecture to leverage the Rayleigh quotient to guide the embedding refining process. GLADMamba can dynamically focus on anomaly-related information while discarding irrelevant information for anomaly detection. To the best of our knowledge, this is the first work to introduce Mamba and explicit spectral information to UGLAD. Extensive experiments on 12 real-world datasets demonstrate that GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in UGLAD. The code is available at this https URL.

Title: Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

Authors: Takahiro Shindo, Yui Tatsumi, Taiju Watanabe, Hiroshi Watanabe
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.17907
Pdf URL: https://arxiv.org/pdf/2503.17907
Copy Paste: [[2503.17907]] Guided Diffusion for the Extension of Machine Vision to Human Visual Perception(https://arxiv.org/abs/2503.17907)
Keywords: diffusion
Abstract: Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.

Title: WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

Authors: Youhui Zuo, Sibo Wei, Chen Zhang, Zhuorui Liu, Wenpeng Lu, Dawei Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17922
Pdf URL: https://arxiv.org/pdf/2503.17922
Copy Paste: [[2503.17922]] WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference(https://arxiv.org/abs/2503.17922)
Keywords: robust, large language model
Abstract: With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

Title: Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

Authors: Zefeng Zhang, Hengzhu Tang, Jiawei Sheng, Zhenyu Zhang, Yiming Ren, Zhenyang Li, Dawei Yin, Duohe Ma, Tingwen Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17928
Pdf URL: https://arxiv.org/pdf/2503.17928
Copy Paste: [[2503.17928]] Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization(https://arxiv.org/abs/2503.17928)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.

Title: STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Authors: Xunguang Wang, Wenxuan Wang, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.17932
Pdf URL: https://arxiv.org/pdf/2503.17932
Copy Paste: [[2503.17932]] STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models(https://arxiv.org/abs/2503.17932)
Keywords: defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Title: Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA

Authors: Justice Ou, Tinglin Huang, Yilun Zhao, Ziyang Yu, Peiqing Lu, Rex Ying
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.17933
Pdf URL: https://arxiv.org/pdf/2503.17933
Copy Paste: [[2503.17933]] Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA(https://arxiv.org/abs/2503.17933)
Keywords: large language model
Abstract: To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences. Motivated by this, we propose Experience Retrieval Augmentation - ExpRAG framework based on Electronic Health Record (EHR), aiming to offer the relevant context from other patients' discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning. To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.

Title: TransAnimate: Taming Layer Diffusion to Generate RGBA Video

Authors: Xuewei Chen, Zhimin Chen, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17934
Pdf URL: https://arxiv.org/pdf/2503.17934
Copy Paste: [[2503.17934]] TransAnimate: Taming Layer Diffusion to Generate RGBA Video(https://arxiv.org/abs/2503.17934)
Keywords: diffusion, generative
Abstract: Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.

Title: An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

Authors: Riya Naik, Ashwin Srinivasan, Estrid He, Swati Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17936
Pdf URL: https://arxiv.org/pdf/2503.17936
Copy Paste: [[2503.17936]] An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models(https://arxiv.org/abs/2503.17936)
Keywords: large language model
Abstract: Natural language as a medium for human-computer interaction has long been anticipated, has been undergoing a sea-change with the advent of Large Language Models (LLMs) with startling capacities for processing and generating language. Many of us now treat LLMs as modern-day oracles, asking it almost any kind of question. Unlike its Delphic predecessor, consulting an LLM does not have to be a single-turn activity (ask a question, receive an answer, leave); and -- also unlike the Pythia -- it is widely acknowledged that answers from LLMs can be improved with additional context. In this paper, we aim to study when we need multi-turn interactions with LLMs to successfully get a question answered; or conclude that a question is unanswerable. We present a neural symbolic framework that models the interactions between human and LLM agents. Through the proposed framework, we define incompleteness and ambiguity in the questions as properties deducible from the messages exchanged in the interaction, and provide results from benchmark problems, in which the answer-correctness is shown to depend on whether or not questions demonstrate the presence of incompleteness or ambiguity (according to the properties we identify). Our results show multi-turn interactions are usually required for datasets which have a high proportion of incompleteness or ambiguous questions; and that that increasing interaction length has the effect of reducing incompleteness or ambiguity. The results also suggest that our measures of incompleteness and ambiguity can be useful tools for characterising interactions with an LLM on question-answeringproblems

Title: FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

Authors: Dong Zhao, Jinlong Li, Shuang Wang, Mengyao Wu, Qi Zang, Nicu Sebe, Zhun Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17940
Pdf URL: https://arxiv.org/pdf/2503.17940
Copy Paste: [[2503.17940]] FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation(https://arxiv.org/abs/2503.17940)
Keywords: robust, segmentation
Abstract: Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.

Title: SLIDE: Sliding Localized Information for Document Extraction

Authors: Divyansh Singh, Manuel Nunez Martinez, Bonnie J. Dorr, Sonja Schmer Galunder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17952
Pdf URL: https://arxiv.org/pdf/2503.17952
Copy Paste: [[2503.17952]] SLIDE: Sliding Localized Information for Document Extraction(https://arxiv.org/abs/2503.17952)
Keywords: extraction, large language model
Abstract: Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.

Title: On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation

Authors: Sami Zhioua, Ruta Binkyte, Ayoub Ouni, Farah Barika Ktata
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17956
Pdf URL: https://arxiv.org/pdf/2503.17956
Copy Paste: [[2503.17956]] On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation(https://arxiv.org/abs/2503.17956)
Keywords: fair
Abstract: Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). Through an extensive set of experiments on benchmark datasets and using mainstream learning algorithms, we expose relevant observations in several model training scenarios. The observations are finally framed as actionable recommendations for practitioners.

Title: Won: Establishing Best Practices for Korean Financial NLP

Authors: Guijin Son, Hyunwoo Ko, Haneral Jung, Chami Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.17963
Pdf URL: https://arxiv.org/pdf/2503.17963
Copy Paste: [[2503.17963]] Won: Establishing Best Practices for Korean Financial NLP(https://arxiv.org/abs/2503.17963)
Keywords: large language model
Abstract: In this work, we present the first open leaderboard for evaluating Korean large language models focused on finance. Operated for about eight weeks, the leaderboard evaluated 1,119 submissions on a closed benchmark covering five MCQA categories: finance and accounting, stock price prediction, domestic company analysis, financial markets, and financial agent tasks and one open-ended qa task. Building on insights from these evaluations, we release an open instruction dataset of 80k instances and summarize widely used training strategies observed among top-performing models. Finally, we introduce Won, a fully open and transparent LLM built using these best practices. We hope our contributions help advance the development of better and safer financial LLMs for Korean and other languages.

Title: Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts

Authors: Beining Xu, Arkaitz Zubiaga
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17965
Pdf URL: https://arxiv.org/pdf/2503.17965
Copy Paste: [[2503.17965]] Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts(https://arxiv.org/abs/2503.17965)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance on a range of downstream NLP tasks by generating text that closely resembles human writing. However, the ease of achieving this similarity raises concerns from potential malicious uses at scale by bad actors, as LLM-generated text becomes increasingly difficult to discern from human text. Although detection methods have been developed to address this issue, bad actors can further manipulate LLM-generated texts to make them less detectable. In this work, we study how further editing texts with Reinforcement Learning from Human Feedback (RLHF), which aligns model outputs with human preferences, affects (a) the quality of generated texts for two tasks, and (b) the performance of LLM-generated text detectors, looking at both training-based and zero-shot detection methods. Although RLHF improves the quality of LLM-generated texts, we find that it also tends to produce more detectable, lengthy, and repetitive outputs. Additionally, we observe that training-based detectors are vulnerable to short texts and to texts that incorporate code, whereas zero-shot detectors exhibit greater robustness.

Title: Real-World Remote Sensing Image Dehazing: Benchmark and Baseline

Authors: Zeng-Hui Zhu, Wei Lu, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, Bin Luo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.17966
Pdf URL: https://arxiv.org/pdf/2503.17966
Copy Paste: [[2503.17966]] Real-World Remote Sensing Image Dehazing: Benchmark and Baseline(https://arxiv.org/abs/2503.17966)
Keywords: robust, extraction
Abstract: Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at \url{this https URL}.

Title: PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

Authors: Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, Yunzhu Li
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.17973
Pdf URL: https://arxiv.org/pdf/2503.17973
Copy Paste: [[2503.17973]] PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos(https://arxiv.org/abs/2503.17973)
Keywords: generative
Abstract: Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.

Title: Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Authors: Yara AlaaEldin, Francesca Odone
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17982
Pdf URL: https://arxiv.org/pdf/2503.17982
Copy Paste: [[2503.17982]] Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images(https://arxiv.org/abs/2503.17982)
Keywords: segmentation
Abstract: Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: this https URL

Title: Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Authors: Chenyu Zhang, Yiwen Ma, Lanjun Wang, Wenhui Li, Yi Tu, An-An Liu
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.17987
Pdf URL: https://arxiv.org/pdf/2503.17987
Copy Paste: [[2503.17987]] Metaphor-based Jailbreaking Attacks on Text-to-Image Models(https://arxiv.org/abs/2503.17987)
Keywords: attack
Abstract: To mitigate misuse, text-to-image~(T2I) models commonly incorporate safety filters to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods use LLMs to generate adversarial prompts that effectively bypass safety filters while generating sensitive images, revealing the safety vulnerabilities within the T2I model. However, existing LLM-based attack methods lack explicit guidance, relying on substantial queries to achieve a successful attack, which limits their practicality in real-world scenarios. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to balance the attack effectiveness and query efficiency by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance the attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Experiments demonstrate that MJA achieves better attack effectiveness while requiring fewer queries compared to baseline methods. Moreover, our adversarial prompts exhibit strong transferability across various open-source and commercial T2I models. \textcolor{red}{This paper includes model-generated content that may contain offensive or distressing material.}

Title: Geometric Constrained Non-Line-of-Sight Imaging

Authors: Xueying Liu, Lianfang Wang, Jun Liu, Yong Wang, Yuping Duan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.17992
Pdf URL: https://arxiv.org/pdf/2503.17992
Copy Paste: [[2503.17992]] Geometric Constrained Non-Line-of-Sight Imaging(https://arxiv.org/abs/2503.17992)
Keywords: robust
Abstract: Normal reconstruction is crucial in non-line-of-sight (NLOS) imaging, as it provides key geometric and lighting information about hidden objects, which significantly improves reconstruction accuracy and scene understanding. However, jointly estimating normals and albedo expands the problem from matrix-valued functions to tensor-valued functions that substantially increasing complexity and computational difficulty. In this paper, we propose a novel joint albedo-surface reconstruction method, which utilizes the Frobenius norm of the shape operator to control the variation rate of the normal field. It is the first attempt to apply regularization methods to the reconstruction of surface normals for hidden objects. By improving the accuracy of the normal field, it enhances detail representation and achieves high-precision reconstruction of hidden object geometry. The proposed method demonstrates robustness and effectiveness on both synthetic and experimental datasets. On transient data captured within 15 seconds, our surface normal-regularized reconstruction model produces more accurate surfaces than recently proposed methods and is 30 times faster than the existing surface reconstruction approach.

Title: Instructing the Architecture Search for Spatial-temporal Sequence Forecasting with LLM

Authors: Xin Xue, Haoyi Zhou, Tianyu Chen, Shuai Zhang, Yizhou Long, Jianxin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17994
Pdf URL: https://arxiv.org/pdf/2503.17994
Copy Paste: [[2503.17994]] Instructing the Architecture Search for Spatial-temporal Sequence Forecasting with LLM(https://arxiv.org/abs/2503.17994)
Keywords: large language model
Abstract: Spatial-temporal sequence forecasting (STSF) is a long-standing research problem with widespread real-world applications. Neural architecture search (NAS), which automates the neural network design, has been shown effective in tackling the STSF problem. However, the existing NAS methods for STSF focus on generating architectures in a time-consuming data-driven fashion, which heavily limits their ability to use background knowledge and explore the complicated search trajectory. Large language models (LLMs) have shown remarkable ability in decision-making with comprehensive internal world knowledge, but how it could benefit NAS for STSF remains unexplored. In this paper, we propose a novel NAS method for STSF based on LLM. Instead of directly generate architectures with LLM, We inspire the LLM's capability with a multi-level enhancement mechanism. Specifically, on the step-level, we decompose the generation task into decision steps with powerful prompt engineering and inspire LLM to serve as instructor for architecture search based on its internal knowledge. On the instance-level, we utilize a one-step tuning framework to quickly evaluate the architecture instance and a memory bank to cumulate knowledge to improve LLM's search ability. On the task-level, we propose a two-stage architecture search, balancing the exploration stage and optimization stage, to reduce the possibility of being trapped in local optima. Extensive experimental results demonstrate that our method can achieve competitive effectiveness with superior efficiency against existing NAS methods for STSF.

Title: SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance

Authors: Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18007
Pdf URL: https://arxiv.org/pdf/2503.18007
Copy Paste: [[2503.18007]] SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance(https://arxiv.org/abs/2503.18007)
Keywords: transformer
Abstract: Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.

Title: Personalized Language Models via Privacy-Preserving Evolutionary Model Merging

Authors: Kyuyoung Kim, Jinwoo Shin, Jaehyung Kim
Subjects: cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2503.18008
Pdf URL: https://arxiv.org/pdf/2503.18008
Copy Paste: [[2503.18008]] Personalized Language Models via Privacy-Preserving Evolutionary Model Merging(https://arxiv.org/abs/2503.18008)
Keywords: privacy, large language model
Abstract: Personalization in large language models (LLMs) seeks to tailor models to individual user or user group preferences. Prompt-based methods augment queries with user preference information, whereas training-based methods directly encode preferences into model parameters for more effective personalization. Despite achieving some success in personalizing LLMs, prior methods often fail to directly optimize task-specific metrics and lack explicit privacy-preservation mechanisms. To address these limitations, we propose Privacy-Preserving Model Merging via Evolutionary Algorithms (PriME), a novel approach to personalization that employs gradient-free methods to directly optimize task-specific metrics while preserving user privacy. By incorporating privacy preservation into optimization, PriME produces a personalized module that effectively captures the target user's preferences while minimizing the privacy risks for the users sharing their private information. Experiments on the LaMP benchmark show that PriME outperforms both prompt-based and training-based methods, achieving up to a 45% performance improvement over the prior art. Further analysis shows that PriME achieves a significantly better privacy-utility trade-off, highlighting the potential of evolutionary approaches for privacy-preserving LLM personalization.

Title: Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Authors: Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18013
Pdf URL: https://arxiv.org/pdf/2503.18013
Copy Paste: [[2503.18013]] Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning(https://arxiv.org/abs/2503.18013)
Keywords: robust
Abstract: Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.

Title: Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Authors: Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18016
Pdf URL: https://arxiv.org/pdf/2503.18016
Copy Paste: [[2503.18016]] Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook(https://arxiv.org/abs/2503.18016)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.

Title: OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

Authors: Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18033
Pdf URL: https://arxiv.org/pdf/2503.18033
Copy Paste: [[2503.18033]] OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models(https://arxiv.org/abs/2503.18033)
Keywords: diffusion
Abstract: Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Title: Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

Authors: Qiao Liang, Yanjiang Liu, Ben He, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun, Yingfei Sun
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18034
Pdf URL: https://arxiv.org/pdf/2503.18034
Copy Paste: [[2503.18034]] Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models(https://arxiv.org/abs/2503.18034)
Keywords: large language model
Abstract: Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.

Title: DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype

Authors: Qiang Wang, Yuhang He, SongLin Dong, Xiang Song, Jizhou Han, Haoyu Luo, Yihong Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18042
Pdf URL: https://arxiv.org/pdf/2503.18042
Copy Paste: [[2503.18042]] DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype(https://arxiv.org/abs/2503.18042)
Keywords: privacy
Abstract: Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.

Title: BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

Authors: Nishavi Ranaweera, Jiarui Xu, Suranga Seneviratne, Aruna Seneviratne
Subjects: cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2503.18043
Pdf URL: https://arxiv.org/pdf/2503.18043
Copy Paste: [[2503.18043]] BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection(https://arxiv.org/abs/2503.18043)
Keywords: protect, attack
Abstract: Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.

Title: Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data

Authors: Xiaochen Zhang, Haoyi Xiong
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.18048
Pdf URL: https://arxiv.org/pdf/2503.18048
Copy Paste: [[2503.18048]] Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data(https://arxiv.org/abs/2503.18048)
Keywords: robust, extraction, interpretability
Abstract: In high-dimensional and high-stakes contexts, ensuring both rigorous statistical guarantees and interpretability in feature extraction from complex tabular data remains a formidable challenge. Traditional methods such as Principal Component Analysis (PCA) reduce dimensionality and identify key features that explain the most variance, but are constrained by their reliance on linear assumptions. In contrast, neural networks offer assumption-free feature extraction through self-supervised learning techniques such as autoencoders, though their interpretability remains a challenge in fields requiring transparency. To address this gap, this paper introduces Spofe, a novel self-supervised machine learning pipeline that marries the power of kernel principal components for capturing nonlinear dependencies with a sparse and principled polynomial representation to achieve clear interpretability with statistical rigor. Underpinning our approach is a robust theoretical framework that delivers precise error bounds and rigorous false discovery rate (FDR) control via a multi-objective knockoff selection procedure; it effectively bridges the gap between data-driven complexity and statistical reliability via three stages: (1) generating self-supervised signals using kernel principal components to model complex patterns, (2) distilling these signals into sparse polynomial functions for improved interpretability, and (3) applying a multi-objective knockoff selection procedure with significance testing to rigorously identify important features. Extensive experiments on diverse real-world datasets demonstrate the effectiveness of Spofe, consistently surpassing KPCA, SKPCA, and other methods in feature selection for regression and classification tasks. Visualization and case studies highlight its ability to uncover key insights, enhancing interpretability and practical utility.

Title: PolarFree: Polarization-based Reflection-free Imaging

Authors: Mingde Yao, Menglu Wang, King-Man Tam, Lingen Li, Tianfan Xue, Jinwei Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18055
Pdf URL: https://arxiv.org/pdf/2503.18055
Copy Paste: [[2503.18055]] PolarFree: Polarization-based Reflection-free Imaging(https://arxiv.org/abs/2503.18055)
Keywords: diffusion
Abstract: Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolaRGB, for Polarization-based reflection removal of RGB images, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolaRGB dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in challenging reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset are available at this https URL.

Title: Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

Authors: Anh Duc Nguyen, Hieu Minh Phi, Anh Viet Ngo, Long Hai Trieu, Thai Phuong Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18062
Pdf URL: https://arxiv.org/pdf/2503.18062
Copy Paste: [[2503.18062]] Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension(https://arxiv.org/abs/2503.18062)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: this https URL.

Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL

Authors: Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18064
Pdf URL: https://arxiv.org/pdf/2503.18064
Copy Paste: [[2503.18064]] Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL(https://arxiv.org/abs/2503.18064)
Keywords: federate
Abstract: Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbf{FedDAH}). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:this https URL.

Title: Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18065
Pdf URL: https://arxiv.org/pdf/2503.18065
Copy Paste: [[2503.18065]] Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation(https://arxiv.org/abs/2503.18065)
Keywords: large language model
Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at this https URL.

Title: Self-Explaining Neural Networks for Business Process Monitoring

Authors: Shahaf Bassan, Shlomit Gur, Sergey Zeltyn, Konstantinos Mavrogiorgos, Ron Eliav, Dimosthenis Kyriazis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18067
Pdf URL: https://arxiv.org/pdf/2503.18067
Copy Paste: [[2503.18067]] Self-Explaining Neural Networks for Business Process Monitoring(https://arxiv.org/abs/2503.18067)
Keywords: explainability
Abstract: Tasks in Predictive Business Process Monitoring (PBPM), such as Next Activity Prediction, focus on generating useful business predictions from historical case logs. Recently, Deep Learning methods, particularly sequence-to-sequence models like Long Short-Term Memory (LSTM), have become a dominant approach for tackling these tasks. However, to enhance model transparency, build trust in the predictions, and gain a deeper understanding of business processes, it is crucial to explain the decisions made by these models. Existing explainability methods for PBPM decisions are typically *post-hoc*, meaning they provide explanations only after the model has been trained. Unfortunately, these post-hoc approaches have shown to face various challenges, including lack of faithfulness, high computational costs and a significant sensitivity to out-of-distribution samples. In this work, we introduce, to the best of our knowledge, the first *self-explaining neural network* architecture for predictive process monitoring. Our framework trains an LSTM model that not only provides predictions but also outputs a concise explanation for each prediction, while adapting the optimization objective to improve the reliability of the explanation. We first demonstrate that incorporating explainability into the training process does not hurt model performance, and in some cases, actually improves it. Additionally, we show that our method outperforms post-hoc approaches in terms of both the faithfulness of the generated explanations and substantial improvements in efficiency.

Title: PanopticSplatting: End-to-End Panoptic Gaussian Splatting

Authors: Yuxuan Xie, Xuan Yu, Changjian Jiang, Sitong Mao, Shunbo Zhou, Rui Fan, Rong Xiong, Yue Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18073
Pdf URL: https://arxiv.org/pdf/2503.18073
Copy Paste: [[2503.18073]] PanopticSplatting: End-to-End Panoptic Gaussian Splatting(https://arxiv.org/abs/2503.18073)
Keywords: robust, segmentation
Abstract: Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.

Title: A Multi-Model Adaptation of Speculative Decoding for Classification

Authors: Somnath Roy, Padharthi Sreekar, Srivatsa Narasimha, Anubhav Anand
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18076
Pdf URL: https://arxiv.org/pdf/2503.18076
Copy Paste: [[2503.18076]] A Multi-Model Adaptation of Speculative Decoding for Classification(https://arxiv.org/abs/2503.18076)
Keywords: robust
Abstract: The current study introduces a novel adaptation of speculative decoding, repurposed from generation to classification tasks. We propose a multi-model framework employing up to three lightweight worker models and a single, more robust judge model analogous to draft models and target model, respectively, in speculative decoding. The worker models, tasked with the bulk of the computation, independently predict discrete class labels for a given input. When majority worker models agree on a label, it is accepted as the final label, optimizing efficiency by bypassing the computationally expensive judge model. In cases of disagreement, the judge model intervenes to resolve the label. This approach minimizes redundant computation, leverages the redundancy of multiple workers for confidence, and confines the judge model's role to challenging cases, offering a practical balance of efficiency and accuracy. Our analysis suggests that smaller out of the box instruction/chat finetuned worker models with 3 billion parameters (hereafter, 3B) demonstrate a level of alignment with judge models comparable to that of larger finetuned worker models with 7 billion parameters (hereafter, 7B) across both simple and higher order reasoning tasks. The top performing 3B worker model pair achieve an agreement rate of approximately 80-83% for sentiment and around 50-80% for similar ticket when compared to judge models. Additionally, 3B worker models provide a speedup ranging from 2.8x to 9x relative to the judge models, while 7B worker model combinations achieve a speedup ranging from 1.28x to 0.28x

Title: Model-Guardian: Protecting against Data-Free Model Stealing Using Gradient Representations and Deceptive Predictions

Authors: Yunfei Yang, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18081
Pdf URL: https://arxiv.org/pdf/2503.18081
Copy Paste: [[2503.18081]] Model-Guardian: Protecting against Data-Free Model Stealing Using Gradient Representations and Deceptive Predictions(https://arxiv.org/abs/2503.18081)
Keywords: protect, defense, attack, steal, diffusion, data-free
Abstract: Model stealing attack is increasingly threatening the confidentiality of machine learning models deployed in the cloud. Recent studies reveal that adversaries can exploit data synthesis techniques to steal machine learning models even in scenarios devoid of real data, leading to data-free model stealing attacks. Existing defenses against such attacks suffer from limitations, including poor effectiveness, insufficient generalization ability, and low comprehensiveness. In response, this paper introduces a novel defense framework named Model-Guardian. Comprising two components, Data-Free Model Stealing Detector (DFMS-Detector) and Deceptive Predictions (DPreds), Model-Guardian is designed to address the shortcomings of current defenses with the help of the artifact properties of synthetic samples and gradient representations of samples. Extensive experiments on seven prevalent data-free model stealing attacks showcase the effectiveness and superior generalization ability of Model-Guardian, outperforming eleven defense methods and establishing a new state-of-the-art performance. Notably, this work pioneers the utilization of various GANs and diffusion models for generating highly realistic query samples in attacks, with Model-Guardian demonstrating accurate detection capabilities.

Title: Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms

Authors: Nachuan Ma, Zhengfei Song, Qiang Hu, Chuang-Wei Liu, Yu Han, Yanting Zhang, Rui Fan, Lihua Xie
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18082
Pdf URL: https://arxiv.org/pdf/2503.18082
Copy Paste: [[2503.18082]] Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms(https://arxiv.org/abs/2503.18082)
Keywords: large language model
Abstract: In the emerging field of urban digital twins (UDTs), advancing intelligent road inspection (IRI) vehicles with automatic road crack detection systems is essential for maintaining civil infrastructure. Over the past decade, deep learning-based road crack detection methods have been developed to detect cracks more efficiently, accurately, and objectively, with the goal of replacing manual visual inspection. Nonetheless, there is a lack of systematic reviews on state-of-the-art (SoTA) deep learning techniques, especially data-fusion and label-efficient algorithms for this task. This paper thoroughly reviews the SoTA deep learning-based algorithms, including (1) supervised, (2) unsupervised, (3) semi-supervised, and (4) weakly-supervised methods developed for road crack detection. Also, we create a dataset called UDTIRI-Crack, comprising $2,500$ high-quality images from seven public annotated sources, as the first extensive online benchmark in this field. Comprehensive experiments are conducted to compare the detection performance, computational efficiency, and generalizability of public SoTA deep learning-based algorithms for road crack detection. In addition, the feasibility of foundation models and large language models (LLMs) for road crack detection is explored. Afterwards, the existing challenges and future development trends of deep learning-based road crack detection algorithms are discussed. We believe this review can serve as practical guidance for developing intelligent road detection vehicles with the next-generation road condition assessment systems. The released benchmark UDTIRI-Crack is available at this https URL.

Title: Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors

Authors: Tianxin Huang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18083
Pdf URL: https://arxiv.org/pdf/2503.18083
Copy Paste: [[2503.18083]] Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors(https://arxiv.org/abs/2503.18083)
Keywords: diffusion, generative
Abstract: With the growth of 3D applications and the rapid increase in sensor-collected 3D point cloud data, there is a rising demand for efficient compression algorithms. Most existing learning-based compression methods handle geometry and color attributes separately, treating them as distinct tasks, making these methods challenging to apply directly to point clouds with colors. Besides, the limited capacities of training datasets also limit their generalizability across points with different distributions. In this work, we introduce a test-time unified geometry and color compression framework of 3D point clouds. Instead of training a compression model based on specific datasets, we adapt a pre-trained generative diffusion model to compress original colored point clouds into sparse sets, termed 'seeds', using prompt tuning. Decompression is then achieved through multiple denoising steps with separate sampling processes. Experiments on objects and indoor scenes demonstrate that our method has superior performances compared to existing baselines for the compression of geometry and color.

Title: Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach

Authors: Rochana Chaturvedi, Peyman Baghershahi, Sourav Medya, Barbara Di Eugenio
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18085
Pdf URL: https://arxiv.org/pdf/2503.18085
Copy Paste: [[2503.18085]] Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach(https://arxiv.org/abs/2503.18085)
Keywords: extraction, transformer
Abstract: Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval $F_1$ score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.

Title: $D^2LoRA$: Data-Driven LoRA Initialization for Low Resource Tasks

Authors: Javad SeraJ, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18089
Pdf URL: https://arxiv.org/pdf/2503.18089
Copy Paste: [[2503.18089]] $D^2LoRA$: Data-Driven LoRA Initialization for Low Resource Tasks(https://arxiv.org/abs/2503.18089)
Keywords: large language model
Abstract: Tuning large language models is essential for optimizing their performance across diverse applications, particularly in scenarios with limited data availability. Tuning large language models in scarce data scenarios is crucial, particularly given that the convergence speed of the LoRA method is lower than that of full fine-tuning. In this paper, we present an analysis of post-training methods including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) within the context of task-specific learning using the LoRA method. Next we introduce $D^2LoRA$, a data-driven approach for initializing LoRA metrics that enhances training efficiency, especially in limited-data settings. Our experiments compare $D^2LoRA$ with vanilla LoRA in terms of performance and catastrophic forgetting under extremely data-constrained conditions. The results demonstrate that $D^2LoRA$ achieves a 1% improvement GSM8K benchmark and a 2-point improvement in ROUGE score in title generation tasks. $D^2LoRA$ facilitates the adaptation of LLMs to multiple tasks even when task-specific data is scarce, thereby reducing training expenses and offering data cost.

Title: M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

Authors: Xuesong Chen, Shaoshuai Shi, Tao Ma, Jingqiu Zhou, Simon See, Ka Chun Cheung, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18100
Pdf URL: https://arxiv.org/pdf/2503.18100
Copy Paste: [[2503.18100]] M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving(https://arxiv.org/abs/2503.18100)
Keywords: transformer, segmentation
Abstract: The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.

Title: PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

Authors: Hongjia Zhai, Hai Li, Zhenzhe Li, Xiaokun Pan, Yijia He, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18107
Pdf URL: https://arxiv.org/pdf/2503.18107
Copy Paste: [[2503.18107]] PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding(https://arxiv.org/abs/2503.18107)
Keywords: segmentation
Abstract: Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Technically, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-plane to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding. Project page: \href{this https URL}{this https URL}.

Title: Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models

Authors: Muhidin A. Mohamed, Shuab D. Ahmed, Yahye A. Isse, Hanad M. Mohamed, Fuad M. Hassan, Houssein A. Assowe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18117
Pdf URL: https://arxiv.org/pdf/2503.18117
Copy Paste: [[2503.18117]] Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models(https://arxiv.org/abs/2503.18117)
Keywords: transformer
Abstract: The fact that everyone with a social media account can create and share content, and the increasing public reliance on social media platforms as a news and information source bring about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms to flag posted materials, the use of AI models provides a more sustainable, scalable, and effective way to mitigate these harmful contents. However, low-resourced languages such as the Somali language face limitations in AI automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics. This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. In particular, we created two human-annotated social-media-sourced Somali datasets for two downstream applications, fake news \& toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) -- the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news topic classification datasets. Comparative evaluation analysis of the proposed model against related multilingual models (e.g., AfriBERTa, AfroXLMR, etc) demonstrated that SomBERTa consistently outperformed these comparators in both fake news and toxic content classification tasks while achieving the best average accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a foundational language model and a replicable framework for other low-resource languages, promoting digital and AI inclusivity and linguistic diversity.

Title: End-to-End Implicit Neural Representations for Classification

Authors: Alexander Gielisse, Jan van Gemert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18123
Pdf URL: https://arxiv.org/pdf/2503.18123
Copy Paste: [[2503.18123]] End-to-End Implicit Neural Representations for Classification(https://arxiv.org/abs/2503.18123)
Keywords: transformer
Abstract: Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at this https URL

Title: GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks

Authors: Varvara Krechetova, Denis Kochedykov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18129
Pdf URL: https://arxiv.org/pdf/2503.18129
Copy Paste: [[2503.18129]] GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks(https://arxiv.org/abs/2503.18129)
Keywords: large language model
Abstract: In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.

Title: Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

Authors: Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18130
Pdf URL: https://arxiv.org/pdf/2503.18130
Copy Paste: [[2503.18130]] Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization(https://arxiv.org/abs/2503.18130)
Keywords: large language model
Abstract: Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.

Title: MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection

Authors: Yibo Yan, Shen Wang, Jiahao Huo, Philip S. Yu, Xuming Hu, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18132
Pdf URL: https://arxiv.org/pdf/2503.18132
Copy Paste: [[2503.18132]] MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection(https://arxiv.org/abs/2503.18132)
Keywords: large language model
Abstract: Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. Our approach decomposes error detection into three phases, each handled by a specialized agent: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of mathematical content by explicitly modeling relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Besides, MathAgent has been successfully deployed in an educational platform that has served over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.

Title: An Image-like Diffusion Method for Human-Object Interaction Detection

Authors: Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18134
Pdf URL: https://arxiv.org/pdf/2503.18134
Copy Paste: [[2503.18134]] An Image-like Diffusion Method for Human-Object Interaction Detection(https://arxiv.org/abs/2503.18134)
Keywords: diffusion
Abstract: Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.

Title: MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Authors: Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18135
Pdf URL: https://arxiv.org/pdf/2503.18135
Copy Paste: [[2503.18135]] MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation(https://arxiv.org/abs/2503.18135)
Keywords: large language model, segmentation
Abstract: Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.

Title: TCFG: Tangential Damping Classifier-free Guidance

Authors: Mingi Kwon, Shin seong Kim, Jaeseok Jeong. Yi Ting Hsiao, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18137
Pdf URL: https://arxiv.org/pdf/2503.18137
Copy Paste: [[2503.18137]] TCFG: Tangential Damping Classifier-free Guidance(https://arxiv.org/abs/2503.18137)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from $x_t$ to $x_{t-1}$, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.

Title: AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs

Authors: Diwei Wang, Cédric Bobenrieth, Hyewon Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18141
Pdf URL: https://arxiv.org/pdf/2503.18141
Copy Paste: [[2503.18141]] AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs(https://arxiv.org/abs/2503.18141)
Keywords: robust, interpretability, generative, large language model
Abstract: Assessing gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases. Despite its widespread use in clinical practice, it is limited by subjectivity and a lack of precision. While recent deep learning-based approaches have consistently improved classification accuracies, they often lack interpretability, hindering their utility in clinical decision-making. To overcome these challenges, we introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a subsequent Large Language Model (LLM) fine-tuned over pairs of motion tokens and Chain-of-Thought (CoT) reasonings. To fine-tune an LLM for pathological gait analysis, we first introduce a multimodal dataset by adding rationales dedicated to MDS-UPDRS gait score assessment to an existing PD gait dataset. We then introduce a two-stage supervised fine-tuning (SFT) strategy to enhance the LLM's motion comprehension with pathology-specific knowledge. This strategy includes: 1) a generative stage that aligns gait motions with analytic descriptions through bidirectional motion-description generation, 2) a reasoning stage that integrates logical Chain-of-Thought (CoT) reasoning for impairment assessment with UPDRS gait score. Validation on an existing dataset and comparisons with state-of-the-art methods confirm the robustness and accuracy of our pipeline, demonstrating its ability to assign gait impairment scores from motion input with clinically meaningful rationales.

Title: LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space

Authors: Zhangyu Wang, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Zeping Liu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18142
Pdf URL: https://arxiv.org/pdf/2503.18142
Copy Paste: [[2503.18142]] LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space(https://arxiv.org/abs/2503.18142)
Keywords: diffusion, generative
Abstract: Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. Existing methods approach it either via grid-based classification or via image retrieval. Their performance significantly suffers when the spatial distribution of test images does not align with such choices. To address these limitations, we propose to leverage diffusion as a mechanism for image geolocalization. To avoid the problematic manifold reprojection step in diffusion, we developed a novel spherical positional encoding-decoding framework, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking. We call this type of position encoding Spherical Harmonics Dirac Delta (SHDD) Representation. We also propose a novel SirenNet-based architecture called CS-UNet to learn the conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. We train a conditional latent diffusion model called LocDiffusion that generates geolocations under the guidance of images -- to the best of our knowledge, the first generative model for image geolocalization by diffusing geolocation information in a hidden location embedding space. We evaluate our method against SOTA image geolocalization baselines. LocDiffusion achieves competitive geolocalization performance and demonstrates significantly stronger generalizability to unseen geolocations.

Title: LongDiff: Training-Free Long Video Generation in One Go

Authors: Zhuoling Li, Hossein Rahmani, Qiuhong Ke, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18150
Pdf URL: https://arxiv.org/pdf/2503.18150
Copy Paste: [[2503.18150]] LongDiff: Training-Free Long Video Generation in One Go(https://arxiv.org/abs/2503.18150)
Keywords: diffusion
Abstract: Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.

Title: Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

Authors: Kelly O. Marshall, Omid Poursaeed, Sergiu Oprea, Amit Kumar, Anushrut Jignasu, Chinmay Hegde, Yilei Li, Rakesh Ranjan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18155
Pdf URL: https://arxiv.org/pdf/2503.18155
Copy Paste: [[2503.18155]] Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes(https://arxiv.org/abs/2503.18155)
Keywords: large language model
Abstract: 3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.

Title: DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Authors: Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian
Subjects: cs.CV, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2503.18159
Pdf URL: https://arxiv.org/pdf/2503.18159
Copy Paste: [[2503.18159]] DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation(https://arxiv.org/abs/2503.18159)
Keywords: diffusion
Abstract: Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: this https URL.

Title: Evaluating Negative Sampling Approaches for Neural Topic Models

Authors: Suman Adhya, Avishek Lahiri, Debarshi Kumar Sanyal, Partha Pratim Das
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18167
Pdf URL: https://arxiv.org/pdf/2503.18167
Copy Paste: [[2503.18167]] Evaluating Negative Sampling Approaches for Neural Topic Models(https://arxiv.org/abs/2503.18167)
Keywords: robust
Abstract: Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.

Title: Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging

Authors: Abderrachid Hamrani, Anuradha Godavarty
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18170
Pdf URL: https://arxiv.org/pdf/2503.18170
Copy Paste: [[2503.18170]] Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging(https://arxiv.org/abs/2503.18170)
Keywords: diffusion, generative, segmentation
Abstract: Producing high-quality segmentation masks for medical images is a fundamental challenge in biomedical image analysis. Recent research has explored large-scale supervised training to enable segmentation across various medical imaging modalities and unsupervised training to facilitate segmentation without dense annotations. However, constructing a model capable of segmenting diverse medical images in a zero-shot manner without any annotations remains a significant hurdle. This paper introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel approach that leverages self-attention diffusion models for zero-shot biomedical image segmentation. ADZUS harnesses the intrinsic capabilities of pre-trained diffusion models, utilizing their generative and discriminative potentials to segment medical images without requiring annotated training data or prior domain-specific knowledge. The ADZUS architecture is detailed, with its integration of self-attention mechanisms that facilitate context-aware and detail-sensitive segmentations being highlighted. Experimental results across various medical imaging datasets, including skin lesion segmentation, chest X-ray infection segmentation, and white blood cell segmentation, reveal that ADZUS achieves state-of-the-art performance. Notably, ADZUS reached Dice scores ranging from 88.7\% to 92.9\% and IoU scores from 66.3\% to 93.3\% across different segmentation tasks, demonstrating significant improvements in handling novel, unseen medical imagery. It is noteworthy that while ADZUS demonstrates high effectiveness, it demands substantial computational resources and extended processing times. The model's efficacy in zero-shot settings underscores its potential to reduce reliance on costly annotations and seamlessly adapt to new medical imaging tasks, thereby expanding the diagnostic capabilities of AI-driven medical imaging technologies.

Title: Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering

Authors: Zixin Chen, Sicheng Song, Kashun Shum, Yanna Lin, Rui Sheng, Huamin Qu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18172
Pdf URL: https://arxiv.org/pdf/2503.18172
Copy Paste: [[2503.18172]] Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering(https://arxiv.org/abs/2503.18172)
Keywords: large language model
Abstract: Misleading chart visualizations, which intentionally manipulate data representations to support specific claims, can distort perceptions and lead to incorrect conclusions. Despite decades of research, misleading visualizations remain a widespread and pressing issue. Recent advances in multimodal large language models (MLLMs) have demonstrated strong chart comprehension capabilities, yet no existing work has systematically evaluated their ability to detect and interpret misleading charts. This paper introduces the Misleading Chart Question Answering (Misleading ChartQA) Benchmark, a large-scale multimodal dataset designed to assess MLLMs in identifying and reasoning about misleading charts. It contains over 3,000 curated examples, covering 21 types of misleaders and 10 chart types. Each example includes standardized chart code, CSV data, and multiple-choice questions with labeled explanations, validated through multi-round MLLM checks and exhausted expert human review. We benchmark 16 state-of-the-art MLLMs on our dataset, revealing their limitations in identifying visually deceptive practices. We also propose a novel pipeline that detects and localizes misleaders, enhancing MLLMs' accuracy in misleading chart interpretation. Our work establishes a foundation for advancing MLLM-driven misleading chart comprehension. We publicly release the sample dataset to support further research in this critical area.

Title: Literature Review: Cyber Security Monitoring in Maritime

Authors: Risto Vaarandi, Leonidas Tsiopoulos, Gabor Visky, Muaan Ur Rehman, Hayretdin Bahsi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.18173
Pdf URL: https://arxiv.org/pdf/2503.18173
Copy Paste: [[2503.18173]] Literature Review: Cyber Security Monitoring in Maritime(https://arxiv.org/abs/2503.18173)
Keywords: security, attack
Abstract: In recent years, many cyber incidents have happened in the maritime sector, targeting the information technology (IT) and operational technology (OT) infrastructure. Although several systematization-of-knowledge papers have been published in the maritime field, none of the previous studies has focused on cyber security monitoring, which aims at timely detection of cyber attacks with automated methods. The current article addresses this research gap and surveys the methods, algorithms, tools and architectures used for cyber security monitoring in the maritime sector. For the survey, a systematic literature review of cyber security monitoring studies is conducted in this article, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol. The first contribution of this article is the bibliometric analysis of related literature and the identification of the main research themes in previous works. For that purpose, our article presents a taxonomy for existing studies which highlights the main properties of maritime cyber security monitoring research. The second contribution of this article is an in-depth analysis of previous works and the identification of research gaps and limitations in existing literature. Based on our findings, we outline future research directions for cyber security monitoring in the maritime field.

Title: Training A Neural Network For Partially Occluded Road Sign Identification In The Context Of Autonomous Vehicles

Authors: Gulnaz Gimaletdinova, Dim Shaiakhmetov, Madina Akpaeva, Mukhammadmuso Abduzhabbarov, Kadyrmamat Momunov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18177
Pdf URL: https://arxiv.org/pdf/2503.18177
Copy Paste: [[2503.18177]] Training A Neural Network For Partially Occluded Road Sign Identification In The Context Of Autonomous Vehicles(https://arxiv.org/abs/2503.18177)
Keywords: robust
Abstract: The increasing number of autonomous vehicles and the rapid development of computer vision technologies underscore the particular importance of conducting research on the accuracy of traffic sign recognition. Numerous studies in this field have already achieved significant results, demonstrating high effectiveness in addressing traffic sign recognition tasks. However, the task becomes considerably more complex when a sign is partially obscured by surrounding objects, such as tree branches, billboards, or other elements of the urban environment. In our study, we investigated how partial occlusion of traffic signs affects their recognition. For this purpose, we collected a dataset comprising 5,746 images, including both fully visible and partially occluded signs, and made it publicly available. Using this dataset, we compared the performance of our custom convolutional neural network (CNN), which achieved 96% accuracy, with models trained using transfer learning. The best result was obtained by VGG16 with full layer unfreezing, reaching 99% accuracy. Additional experiments revealed that models trained solely on fully visible signs lose effectiveness when recognizing occluded signs. This highlights the critical importance of incorporating real-world data with partial occlusion into training sets to ensure robust model performance in complex practical scenarios and to enhance the safety of autonomous driving.

Title: Causality-Aware Next Location Prediction Framework based on Human Mobility Stratification

Authors: Xiaojie Yang, Zipei Fan, Hangli Ge, Takashi Michikata, Ryosuke Shibasaki, Noboru Koshizuka
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2503.18179
Pdf URL: https://arxiv.org/pdf/2503.18179
Copy Paste: [[2503.18179]] Causality-Aware Next Location Prediction Framework based on Human Mobility Stratification(https://arxiv.org/abs/2503.18179)
Keywords: interpretability
Abstract: Human mobility data are fused with multiple travel patterns and hidden spatiotemporal patterns are extracted by integrating user, location, and time information to improve next location prediction accuracy. In existing next location prediction methods, different causal relationships that result from patterns in human mobility data are ignored, which leads to confounding information that can have a negative effect on predictions. Therefore, this study introduces a causality-aware framework for next location prediction, focusing on human mobility stratification for travel patterns. In our research, a novel causal graph is developed that describes the relationships between various input variables. We use counterfactuals to enhance the indirect effects in our causal graph for specific travel patterns: non-anchor targeted travels. The proposed framework is designed as a plug-and-play module that integrates multiple next location prediction paradigms. We tested our proposed framework using several state-of-the-art models and human mobility datasets, and the results reveal that the proposed module improves the prediction performance. In addition, we provide results from the ablation study and quantitative study to demonstrate the soundness of our causal graph and its ability to further enhance the interpretability of the current next location prediction models.

Title: Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization

Authors: Divya Patel, Vansh Parikh, Om Patel, Agam Shah, Bhaskar Chaudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18182
Pdf URL: https://arxiv.org/pdf/2503.18182
Copy Paste: [[2503.18182]] Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization(https://arxiv.org/abs/2503.18182)
Keywords: robust, extraction
Abstract: In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.

Title: FROG: Fair Removal on Graphs

Authors: Ziheng Chen, Jiali Cheng, Gabriele Tolomei, Sijia Liu, Hadi Amiri, Yu Wang, Kaushiki Nag, Lu Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18197
Pdf URL: https://arxiv.org/pdf/2503.18197
Copy Paste: [[2503.18197]] FROG: Fair Removal on Graphs(https://arxiv.org/abs/2503.18197)
Keywords: privacy, fair
Abstract: As compliance with privacy regulations becomes increasingly critical, the growing demand for data privacy has highlighted the significance of machine unlearning in many real world applications, such as social network and recommender systems, many of which can be represented as graph-structured data. However, existing graph unlearning algorithms indiscriminately modify edges or nodes from well-trained models without considering the potential impact of such structural modifications on fairness. For example, forgetting links between nodes with different genders in a social network may exacerbate group disparities, leading to significant fairness concerns. To address these challenges, we propose a novel approach that jointly optimizes the graph structure and the corresponding model for fair unlearning tasks. Specifically,our approach rewires the graph to enhance unlearning efficiency by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. Additionally, we introduce a worst-case evaluation mechanism to assess the reliability of fair unlearning performance. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed approach in achieving superior unlearning outcomes.

Title: SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

Authors: Zhengyuan Li, Kai Cheng, Anindita Ghosh, Uttaran Bhattacharya, Liangyan Gui, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18211
Pdf URL: https://arxiv.org/pdf/2503.18211
Copy Paste: [[2503.18211]] SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction(https://arxiv.org/abs/2503.18211)
Keywords: diffusion, transformer
Abstract: Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.

Title: LakotaBERT: A Transformer-based Model for Low Resource Lakota Language

Authors: Kanishka Parankusham, Rodrigue Rizk, KC Santosh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18212
Pdf URL: https://arxiv.org/pdf/2503.18212
Copy Paste: [[2503.18212]] LakotaBERT: A Transformer-based Model for Low Resource Lakota Language(https://arxiv.org/abs/2503.18212)
Keywords: transformer, large language model
Abstract: Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper introduces LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.

Title: Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

Authors: Roberto Garcia, Jerry Liu, Daniel Sorvisto, Sabri Eyuboglu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18216
Pdf URL: https://arxiv.org/pdf/2503.18216
Copy Paste: [[2503.18216]] Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters(https://arxiv.org/abs/2503.18216)
Keywords: robust, transformer, large language model
Abstract: Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by $\sim$44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.

Title: MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps

Authors: Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbül, Alexander Mathis, Devis Tuia
Subjects: cs.CV, cs.IR, q-bio.NC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.18223
Pdf URL: https://arxiv.org/pdf/2503.18223
Copy Paste: [[2503.18223]] MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps(https://arxiv.org/abs/2503.18223)
Keywords: segmentation
Abstract: Monitoring wildlife is essential for ecology and ethology, especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife populations at scale with minimal disturbance. However, the lack of annotated video datasets limits the development of powerful video understanding models needed to process the vast amount of fieldwork data collected. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. Furthermore, we also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data are available at: this https URL

Title: A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

Authors: Shubhankar Agarwal, Hamzah I. Khan, Sandeep P. Chinchali, David Fridovich-Keil
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2503.18224
Pdf URL: https://arxiv.org/pdf/2503.18224
Copy Paste: [[2503.18224]] A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games(https://arxiv.org/abs/2503.18224)
Keywords: generative
Abstract: Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly integrate existing and novel approaches to this problem. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets in black-box nonconvex-nonconcave settings, showcasing their ability to efficiently locate local saddle points in these contexts.

Title: Decoupling Angles and Strength in Low-rank Adaptation

Authors: Massimo Bini, Leander Girrbach, Zeynep Akata
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18225
Pdf URL: https://arxiv.org/pdf/2503.18225
Copy Paste: [[2503.18225]] Decoupling Angles and Strength in Low-rank Adaptation(https://arxiv.org/abs/2503.18225)
Keywords: robust
Abstract: Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at this https URL.

Title: PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation

Authors: Yiheng Zhong, Zihong Luo, Chengzhi Liu, Feilong Tang, Zelin Peng, Ming Hu, Yingzhen Hu, Jionglong Su, Zongyuan Geand, Imran Razzak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18227
Pdf URL: https://arxiv.org/pdf/2503.18227
Copy Paste: [[2503.18227]] PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation(https://arxiv.org/abs/2503.18227)
Keywords: robust, segmentation
Abstract: Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our anonymous code is released at this https URL.

Title: KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies

Authors: Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyanov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18234
Pdf URL: https://arxiv.org/pdf/2503.18234
Copy Paste: [[2503.18234]] KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies(https://arxiv.org/abs/2503.18234)
Keywords: robust
Abstract: Soft Actor-Critic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC's stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA introduces an additional co-behavior agent that works alongside SAC and a switching mechanism to facilitate proactive coordination between exploration strategies from novelty-based exploration and stochastic policy. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task and then evaluate KEA on sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups.

Title: ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices

Authors: Aneesh Vathul, Daniel Lee, Sheryl Chen, Arthi Tasmia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18242
Pdf URL: https://arxiv.org/pdf/2503.18242
Copy Paste: [[2503.18242]] ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices(https://arxiv.org/abs/2503.18242)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities on a broad array of NLP tasks, but their tendency to produce hallucinations$\unicode{x2013}$plausible-sounding but factually incorrect content$\unicode{x2013}$poses severe challenges in high-stakes domains. Existing hallucination detection methods either bear the computational cost of multiple inference passes or sacrifice accuracy for efficiency with single-pass approaches, neither of which is ideal in resource-constrained environments such as edge devices. We propose the Shannon Entropy Distribution Hallucination Detector (ShED-HD), a novel hallucination detection framework that bridges this gap by classifying sequence-level entropy patterns using a lightweight BiLSTM architecture with single-headed attention. In contrast to prior approaches, ShED-HD efficiently detects distinctive uncertainty patterns across entire output sequences, preserving contextual awareness. Through in-depth evaluation on three datasets (BioASQ, TriviaQA, and Jeopardy Questions), we show that ShED-HD significantly outperforms other computationally efficient approaches in the out-of-distribution setting, while achieving comparable performance in the in-distribution setting. ShED-HD facilitates hallucination detection that is low-cost, accurate, and generalizable, improving the credibility of content generated by LLMs in resource-constrained environments where trustworthy AI functionality is crucial.

Title: DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching

Authors: Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang, Ying Zhang, Xuemin Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18245
Pdf URL: https://arxiv.org/pdf/2503.18245
Copy Paste: [[2503.18245]] DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching(https://arxiv.org/abs/2503.18245)
Keywords: diffusion, generative
Abstract: The Graph Edit Distance (GED) problem, which aims to compute the minimum number of edit operations required to transform one graph into another, is a fundamental challenge in graph analysis with wide-ranging applications. However, due to its NP-hard nature, traditional A* approaches often suffer from scalability issue, making them computationally intractable for large graphs. Many recent deep learning frameworks address GED by formulating it as a regression task, which, while efficient, fails to recover the edit path -- a central interest in GED. Furthermore, recent hybrid approaches that combine deep learning with traditional methods to recover the edit path often yield poor solution quality. These methods also struggle to generate candidate solutions in parallel, resulting in increased running this http URL this paper, we present a novel approach, DiffGED, that leverages generative diffusion model to solve GED and recover the corresponding edit path. Specifically, we first generate multiple diverse node matching matrices in parallel through a diffusion-based graph matching model. Next, node mappings are extracted from each generated matching matrices in parallel, and each extracted node mapping can be simply transformed into an edit path. Benefiting from the generative diversity provided by the diffusion model, DiffGED is less likely to fall into local sub-optimal solutions, thereby achieving superior overall solution quality close to the exact solution. Experimental results on real-world datasets demonstrate that DiffGED can generate multiple diverse edit paths with exceptionally high accuracy comparable to exact solutions while maintaining a running time shorter than most of hybrid approaches.

Title: Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Authors: Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Grigori Sidorov, Seid Muhie Yimam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18253
Pdf URL: https://arxiv.org/pdf/2503.18253
Copy Paste: [[2503.18253]] Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages(https://arxiv.org/abs/2503.18253)
Keywords: large language model
Abstract: In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simultaneously in a single instance, annotating emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025) dataset effectively captures this dynamic. Additionally, incorporating intensity, or the degree of emotion, is crucial, as emotions can significantly differ in their expressive strength and impact. This intensity is significant for assessing whether further action is necessary in decision-making processes, especially concerning negative emotions in applications such as healthcare and mental health studies. To enhance the EthioEmo dataset, we include annotations for the intensity of each labeled emotion. Furthermore, we evaluate various state-of-the-art encoder-only Pretrained Language Models (PLMs) and decoder-only Large Language Models (LLMs) to provide comprehensive benchmarking.

Title: Surface-Aware Distilled 3D Semantic Features

Authors: Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.18254
Pdf URL: https://arxiv.org/pdf/2503.18254
Copy Paste: [[2503.18254]] Surface-Aware Distilled 3D Semantic Features(https://arxiv.org/abs/2503.18254)
Keywords: robust, segmentation
Abstract: Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as "left hand" versus "right hand" which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape's surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including in-part segmentation, pose alignment, and motion transfer. The project site is available at this https URL.

Title: The Human-Machine Identity Blur: A Unified Framework for Cybersecurity Risk Management in 2025

Authors: Kush Janani
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18255
Pdf URL: https://arxiv.org/pdf/2503.18255
Copy Paste: [[2503.18255]] The Human-Machine Identity Blur: A Unified Framework for Cybersecurity Risk Management in 2025(https://arxiv.org/abs/2503.18255)
Keywords: security, attack
Abstract: The modern enterprise is facing an unprecedented surge in digital identities, with machine identities now significantly outnumbering human identities. This paper examines the cybersecurity risks emerging from what we define as the "human-machine identity blur" - the point at which human and machine identities intersect, delegate authority, and create new attack surfaces. Drawing from industry data, expert insights, and real-world incident analysis, we identify key governance gaps in current identity management models that treat human and machine entities as separate domains. To address these challenges, we propose a Unified Identity Governance Framework based on four core principles: treating identity as a continuum rather than a binary distinction, applying consistent risk evaluation across all identity types, implementing continuous verification guided by zero trust principles, and maintaining governance throughout the entire identity lifecycle. Our research shows that organizations adopting this unified approach experience a 47 percent reduction in identity-related security incidents and a 62 percent improvement in incident response time. We conclude by offering a practical implementation roadmap and outlining future research directions as AI-driven systems become increasingly autonomous.

Title: Analyzing Islamophobic Discourse Using Semi-Coded Terms and LLMs

Authors: Raza Ul Mustafa, Roi Dupart, Gabrielle Smith, Noman Ashraf, Nathalie Japkowicz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18273
Pdf URL: https://arxiv.org/pdf/2503.18273
Copy Paste: [[2503.18273]] Analyzing Islamophobic Discourse Using Semi-Coded Terms and LLMs(https://arxiv.org/abs/2503.18273)
Keywords: large language model
Abstract: Islamophobia started evolving into a global phenomenon by attracting followers across the globe, particularly in Western societies. Thus, understanding Islamophobia's global spread and online dissemination is crucial. This paper performs a large-scale analysis of specialized, semi-coded Islamophobic terms such as (muzrat, pislam, mudslime, mohammedan, muzzies) floated on extremist social platforms, i.e., 4Chan, Gab, Telegram, etc. First, we use large language models (LLMs) to show their ability to understand these terms. Second, using Google Perspective API, we also find that Islamophobic text is more toxic compared to other kinds of hate speech. Finally, we use BERT topic modeling approach to extract different topics and Islamophobic discourse on these social platforms. Our findings indicate that LLMs understand these Out-Of-Vocabulary (OOV) slurs; however, measures are still required to control such discourse. Our topic modeling also indicates that Islamophobic text is found across various political, conspiratorial, and far-right movements and is particularly directed against Muslim immigrants. Taken altogether, we performed the first study on Islamophobic semi-coded terms and shed a global light on Islamophobia.

Title: TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos

Authors: Kazuhiro Yamada, Li Yin, Qingrui Hu, Ning Ding, Shunsuke Iwashita, Jun Ichikawa, Kiwamu Kotani, Calvin Yeung, Keisuke Fujii
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18282
Pdf URL: https://arxiv.org/pdf/2503.18282
Copy Paste: [[2503.18282]] TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos(https://arxiv.org/abs/2503.18282)
Keywords: robust
Abstract: Multi-object tracking, player identification, and pose estimation are fundamental components of sports analytics, essential for analyzing player movements, performance, and tactical strategies. However, existing datasets and methodologies primarily target mainstream team sports such as soccer and conventional 5-on-5 basketball, often overlooking scenarios involving fixed-camera setups commonly used at amateur levels, less mainstream sports, or datasets that explicitly incorporate pose annotations. In this paper, we propose the TrackID3x3 dataset, the first publicly available comprehensive dataset specifically designed for multi-player tracking, player identification, and pose estimation in 3x3 basketball scenarios. The dataset comprises three distinct subsets (Indoor fixed-camera, Outdoor fixed-camera, and Drone camera footage), capturing diverse full-court camera perspectives and environments. We also introduce the Track-ID task, a simplified variant of the game state reconstruction task that excludes field detection and focuses exclusively on fixed-camera scenarios. To evaluate performance, we propose a baseline algorithm called Track-ID algorithm, tailored to assess tracking and identification quality. Furthermore, our benchmark experiments, utilizing recent multi-object tracking algorithms (e.g., BoT-SORT-ReID) and top-down pose estimation methods (HRNet, RTMPose, and SwinPose), demonstrate robust results and highlight remaining challenges. Our dataset and evaluation benchmarks provide a solid foundation for advancing automated analytics in 3x3 basketball. Dataset and code will be available at this https URL.

Title: CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

Authors: Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, Vikash Sehwag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18286
Pdf URL: https://arxiv.org/pdf/2503.18286
Copy Paste: [[2503.18286]] CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI(https://arxiv.org/abs/2503.18286)
Keywords: robust, generative
Abstract: With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at this https URL.

Title: Sun-Shine: A Large Language Model for Tibetan Culture

Authors: Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18288
Pdf URL: https://arxiv.org/pdf/2503.18288
Copy Paste: [[2503.18288]] Sun-Shine: A Large Language Model for Tibetan Culture(https://arxiv.org/abs/2503.18288)
Keywords: large language model
Abstract: Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.

Title: When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD

Authors: Paul K. Mandal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18290
Pdf URL: https://arxiv.org/pdf/2503.18290
Copy Paste: [[2503.18290]] When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD(https://arxiv.org/abs/2503.18290)
Keywords: robust
Abstract: In this paper, I investigate the effectiveness of dataset cartography for extractive question answering on the SQuAD dataset. I begin by analyzing annotation artifacts in SQuAD and evaluate the impact of two adversarial datasets, AddSent and AddOneSent, on an ELECTRA-small model. Using training dynamics, I partition SQuAD into easy-to-learn, ambiguous, and hard-to-learn subsets. I then compare the performance of models trained on these subsets to those trained on randomly selected samples of equal size. Results show that training on cartography-based subsets does not improve generalization to the SQuAD validation set or the AddSent adversarial set. While the hard-to-learn subset yields a slightly higher F1 score on the AddOneSent dataset, the overall gains are limited. These findings suggest that dataset cartography provides little benefit for adversarial robustness in SQuAD-style QA tasks. I conclude by comparing these results to prior findings on SNLI and discuss possible reasons for the observed differences.

Title: Fact-checking AI-generated news reports: Can LLMs catch their own lies?

Authors: Jiayi Yao, Haibo Sun, Nianwen Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18293
Pdf URL: https://arxiv.org/pdf/2503.18293
Copy Paste: [[2503.18293]] Fact-checking AI-generated news reports: Can LLMs catch their own lies?(https://arxiv.org/abs/2503.18293)
Keywords: large language model
Abstract: In this paper, we evaluate the ability of Large Language Models (LLMs) to assess the veracity of claims in ''news reports'' generated by themselves or other LLMs. Our goal is to determine whether LLMs can effectively fact-check their own content, using methods similar to those used to verify claims made by humans. Our findings indicate that LLMs are more effective at assessing claims in national or international news stories than in local news stories, better at evaluating static information than dynamic information, and better at verifying true claims compared to false ones. We hypothesize that this disparity arises because the former types of claims are better represented in the training data. Additionally, we find that incorporating retrieved results from a search engine in a Retrieval-Augmented Generation (RAG) setting significantly reduces the number of claims an LLM cannot assess. However, this approach also increases the occurrence of incorrect assessments, partly due to irrelevant or low-quality search results. This diagnostic study highlights the need for future research on fact-checking machine-generated reports to prioritize improving the precision and relevance of retrieved information to better support fact-checking efforts. Furthermore, claims about dynamic events and local news may require human-in-the-loop fact-checking systems to ensure accuracy and reliability.

Title: LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images

Authors: Fiseha B. Tesema, Alejandro Guerra Manzanares, Tianxiang Cui, Qian Zhang, Moses Solomon, Sean He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18294
Pdf URL: https://arxiv.org/pdf/2503.18294
Copy Paste: [[2503.18294]] LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images(https://arxiv.org/abs/2503.18294)
Keywords: robust, extraction, segmentation
Abstract: Colorectal cancer (CRC) is a major global cause of cancer-related deaths, with early polyp detection and removal during colonoscopy being crucial for prevention. While deep learning methods have shown promise in polyp segmentation, challenges such as high computational costs, difficulty in segmenting small or low-contrast polyps, and limited generalizability across datasets persist. To address these issues, we propose LGPS, a lightweight GAN-based framework for polyp segmentation. LGPS incorporates three key innovations: (1) a MobileNetV2 backbone enhanced with modified residual blocks and Squeeze-and-Excitation (ResE) modules for efficient feature extraction; (2) Convolutional Conditional Random Fields (ConvCRF) for precise boundary refinement; and (3) a hybrid loss function combining Binary Cross-Entropy, Weighted IoU Loss, and Dice Loss to address class imbalance and enhance segmentation accuracy. LGPS is validated on five benchmark datasets and compared with state-of-the-art(SOTA) methods. On the largest and challenging PolypGen test dataset, LGPS achieves a Dice of 0.7299 and an IoU of 0.7867, outperformed all SOTA works and demonstrating robust generalization. With only 1.07 million parameters, LGPS is 17 times smaller than the smallest existing model, making it highly suitable for real-time clinical applications. Its lightweight design and strong performance underscore its potential for improving early CRC diagnosis. Code is available at this https URL.

Title: Surgical Action Planning with Large Language Models

Authors: Mengya Xu, Zhongzhen Huang, Jie Zhang, Xiaofan Zhang, Qi Dou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18296
Pdf URL: https://arxiv.org/pdf/2503.18296
Copy Paste: [[2503.18296]] Surgical Action Planning with Large Language Models(https://arxiv.org/abs/2503.18296)
Keywords: privacy, large language model
Abstract: In robot-assisted minimally invasive surgery, we introduce the Surgical Action Planning (SAP) task, which generates future action plans from visual inputs to address the absence of intraoperative predictive planning in current intelligent applications. SAP shows great potential for enhancing intraoperative guidance and automating procedures. However, it faces challenges such as understanding instrument-action relationships and tracking surgical progress. Large Language Models (LLMs) show promise in understanding surgical video content but remain underexplored for predictive decision-making in SAP, as they focus mainly on retrospective analysis. Challenges like data privacy, computational demands, and modality-specific constraints further highlight significant research gaps. To tackle these challenges, we introduce LLM-SAP, a Large Language Models-based Surgical Action Planning framework that predicts future actions and generates text responses by interpreting natural language prompts of surgical goals. The text responses potentially support surgical education, intraoperative decision-making, procedure documentation, and skill analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory Module (NHF-MM) for modeling historical states and the prompts factory for action planning. We evaluate LLM-SAP on our constructed CholecT50-SAP dataset using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in next-action prediction. Pre-trained LLMs are tested zero-shot, and supervised fine-tuning (SFT) with LoRA is implemented to address data privacy concerns. Our experiments show that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3% higher accuracy.

Title: Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Authors: Yishen Liu, Shengda Liu, Hudan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18297
Pdf URL: https://arxiv.org/pdf/2503.18297
Copy Paste: [[2503.18297]] Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module(https://arxiv.org/abs/2503.18297)
Keywords: transformer, large language model
Abstract: Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.

Title: Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models

Authors: Jianlong Jin, Chenglong Zhao, Ruixin Zhang, Sheng Shang, Jianqing Xu, Jingyun Zhang, ShaoMing Wang, Yang Zhao, Shouhong Ding, Wei Jia, Yunsheng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18312
Pdf URL: https://arxiv.org/pdf/2503.18312
Copy Paste: [[2503.18312]] Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models(https://arxiv.org/abs/2503.18312)
Keywords: diffusion
Abstract: Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted Bézier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints. However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints. This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency. To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method. By applying our proposed $K$-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency. Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets. Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.

Title: LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

Authors: Christoforos N. Spartalis, Theodoros Semertzidis, Stratis Gavves, Petros Daras
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18314
Pdf URL: https://arxiv.org/pdf/2503.18314
Copy Paste: [[2503.18314]] LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty(https://arxiv.org/abs/2503.18314)
Keywords: transformer
Abstract: We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model -- up to an information theoretic bound -- mitigating its over-confidence that stems from data memorization. We evaluate LoTUS on the Transformer and ResNet18 models, against eight baseline methods, on five public datasets. Beyond established MU benchmarks, we evaluate unlearning on a large-scale dataset (ImageNet1k) which deters retraining, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. Experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: this https URL.

Title: Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection

Authors: Fei Zuo, Junghwan Rhee, Yung Ryn Choe
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.18316
Pdf URL: https://arxiv.org/pdf/2503.18316
Copy Paste: [[2503.18316]] Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection(https://arxiv.org/abs/2503.18316)
Keywords: security, attack, steal, large language model
Abstract: Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.

Title: Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization

Authors: Ruijia Zhang, Mingxi Lei, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18317
Pdf URL: https://arxiv.org/pdf/2503.18317
Copy Paste: [[2503.18317]] Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization(https://arxiv.org/abs/2503.18317)
Keywords: privacy, generative
Abstract: In this paper, we study the problem of (finite sum) minimax optimization in the Differential Privacy (DP) model. Unlike most of the previous studies on the (strongly) convex-concave settings or loss functions satisfying the Polyak-Lojasiewicz condition, here we mainly focus on the nonconvex-strongly-concave one, which encapsulates many models in deep learning such as deep AUC maximization. Specifically, we first analyze a DP version of Stochastic Gradient Descent Ascent (SGDA) and show that it is possible to get a DP estimator whose $l_2$-norm of the gradient for the empirical risk function is upper bounded by $\tilde{O}(\frac{d^{1/4}}{({n\epsilon})^{1/2}})$, where $d$ is the model dimension and $n$ is the sample size. We then propose a new method with less gradient noise variance and improve the upper bound to $\tilde{O}(\frac{d^{1/3}}{(n\epsilon)^{2/3}})$, which matches the best-known result for DP Empirical Risk Minimization with non-convex loss. We also discussed several lower bounds of private minimax optimization. Finally, experiments on AUC maximization, generative adversarial networks, and temporal difference learning with real-world data support our theoretical analysis.

Title: Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

Authors: Basim Azam, Naveed Akhtar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18324
Pdf URL: https://arxiv.org/pdf/2503.18324
Copy Paste: [[2503.18324]] Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control(https://arxiv.org/abs/2503.18324)
Keywords: fair, interpretability, diffusion, generative
Abstract: Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.

Title: Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Authors: Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18325
Pdf URL: https://arxiv.org/pdf/2503.18325
Copy Paste: [[2503.18325]] Towards Training-free Anomaly Detection with Vision and Language Foundation Models(https://arxiv.org/abs/2503.18325)
Keywords: robust
Abstract: Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at this https URL.

Title: Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Authors: Haotian Zhai, Xinyu Chen, Can Zhang, Tianming Sha, Ruirui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18334
Pdf URL: https://arxiv.org/pdf/2503.18334
Copy Paste: [[2503.18334]] Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models(https://arxiv.org/abs/2503.18334)
Keywords: robust
Abstract: Test-time adaptation (TTA) of visual language models has recently attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks. However, existing cache-based TTA methods have certain limitations. They mainly rely on the accuracy of cached feature labels, and the presence of noisy pseudo-labels can cause these features to deviate from their true distribution. This makes cache retrieval methods based on similarity matching highly sensitive to outliers or extreme samples. Moreover, current methods lack effective mechanisms to model class distributions, which limits their ability to fully exploit the potential of cached information. To address these challenges, we introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called ``Cache, Residual, Gaussian" (CRG). This method not only employs learnable residual parameters to better align positive and negative visual prototypes with text prototypes, thereby optimizing the quality of cached features, but also incorporates Gaussian Discriminant Analysis (GDA) to dynamically model intra-class feature distributions, further mitigating the impact of noisy features. Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.

Title: Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

Authors: Zichen Miao, Wei Chen, Qiang Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18337
Pdf URL: https://arxiv.org/pdf/2503.18337
Copy Paste: [[2503.18337]] Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models(https://arxiv.org/abs/2503.18337)
Keywords: transformer
Abstract: Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.

Title: SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

Authors: Wenrui Cai, Qingjie Liu, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18338
Pdf URL: https://arxiv.org/pdf/2503.18338
Copy Paste: [[2503.18338]] SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking(https://arxiv.org/abs/2503.18338)
Keywords: extraction, transformer
Abstract: Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at this https URL.

Title: PS-EIP: Robust Photometric Stereo Based on Event Interval Profile

Authors: Kazuma Kitazawa, Takahito Aoto, Satoshi Ikehata, Tsuyoshi Takatani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18341
Pdf URL: https://arxiv.org/pdf/2503.18341
Copy Paste: [[2503.18341]] PS-EIP: Robust Photometric Stereo Based on Event Interval Profile(https://arxiv.org/abs/2503.18341)
Keywords: robust
Abstract: Recently, the energy-efficient photometric stereo method using an event camera has been proposed to recover surface normals from events triggered by changes in logarithmic Lambertian reflections under a moving directional light source. However, EventPS treats each event interval independently, making it sensitive to noise, shadows, and non-Lambertian reflections. This paper proposes Photometric Stereo based on Event Interval Profile (PS-EIP), a robust method that recovers pixelwise surface normals from a time-series profile of event intervals. By exploiting the continuity of the profile and introducing an outlier detection method based on profile shape, our approach enhances robustness against outliers from shadows and specular reflections. Experiments using real event data from 3D-printed objects demonstrate that PS-EIP significantly improves robustness to outliers compared to EventPS's deep-learning variant, EventPS-FCN, without relying on deep learning.

Title: Attacking and Improving the Tor Directory Protocol

Authors: Zhongtang Luo, Adithya Bhat, Kartik Nayak, Aniket Kate
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.18345
Pdf URL: https://arxiv.org/pdf/2503.18345
Copy Paste: [[2503.18345]] Attacking and Improving the Tor Directory Protocol(https://arxiv.org/abs/2503.18345)
Keywords: secure, security, privacy, attack
Abstract: The Tor network enhances clients' privacy by routing traffic through an overlay network of volunteered intermediate relays. Tor employs a distributed protocol among nine hard-coded Directory Authority (DA) servers to securely disseminate information about these relays to produce a new consensus document every hour. With a straightforward voting mechanism to ensure consistency, the protocol is expected to be secure even when a minority of those authorities get compromised. However, the current consensus protocol is flawed: it allows an equivocation attack that enables only a single compromised authority to create a valid consensus document with malicious relays. Importantly the vulnerability is not innocuous: We demonstrate that the compromised authority can effectively trick a targeted client into using the equivocated consensus document in an undetectable manner. Moreover, even if we have archived Tor consensus documents available since its beginning, we cannot be sure that no client was ever tricked. We propose a two-stage solution to deal with this exploit. In the short term, we have developed and deployed TorEq, a monitor to detect such exploits reactively: the Tor clients can refer to the monitor before updating the consensus to ensure no equivocation. To solve the problem proactively, we first define the Tor DA consensus problem as the interactive consistency (IC) problem from the distributed computing literature. We then design DirCast, a novel secure Byzantine Broadcast protocol that requires minimal code change from the current Tor DA code base. Our protocol has near-optimal efficiency that uses optimistically five rounds and at most nine rounds to reach an agreement in the current nine-authority system. We are communicating with the Tor security team to incorporate the solutions into the Tor project.

Title: Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners

Authors: Wen Zheng Terence Ng, Jianda Chen, Yuan Xu, Tianwei Zhang
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18347
Pdf URL: https://arxiv.org/pdf/2503.18347
Copy Paste: [[2503.18347]] Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners(https://arxiv.org/abs/2503.18347)
Keywords: diffusion
Abstract: This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.

Title: Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Authors: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18352
Pdf URL: https://arxiv.org/pdf/2503.18352
Copy Paste: [[2503.18352]] Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models(https://arxiv.org/abs/2503.18352)
Keywords: diffusion
Abstract: In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Title: Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation

Authors: Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18358
Pdf URL: https://arxiv.org/pdf/2503.18358
Copy Paste: [[2503.18358]] Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation(https://arxiv.org/abs/2503.18358)
Keywords: segmentation
Abstract: Temporal action segmentation in untrimmed procedural videos aims to densely label frames into action classes. These videos inherently exhibit long-tailed distributions, where actions vary widely in frequency and duration. In temporal action segmentation approaches, we identified a bi-level learning bias. This bias encompasses (1) a class-level bias, stemming from class imbalance favoring head classes, and (2) a transition-level bias arising from variations in transitions, prioritizing commonly observed transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases. We define learning states for action classes and their associated transitions and integrate them into the optimization process. We propose a novel cost-sensitive loss function formulated as a weighted cross-entropy loss, with weights adaptively adjusted based on the learning state of actions and their transitions. Experiments on three challenging temporal segmentation benchmarks and various frameworks demonstrate the effectiveness of our approach, resulting in significant improvements in both per-class frame-wise and segment-wise performance.

Title: Context-Enhanced Memory-Refined Transformer for Online Action Detection

Authors: Zhanzhong Pang, Fadime Sener, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18359
Pdf URL: https://arxiv.org/pdf/2503.18359
Copy Paste: [[2503.18359]] Context-Enhanced Memory-Refined Transformer for Online Action Detection(https://arxiv.org/abs/2503.18359)
Keywords: transformer
Abstract: Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.

Title: J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain

Authors: Yiran Hu, Huanghai Liu, Qingjing Chen, Ning Zheng, Chong Wang, Yun Liu, Charles L.A. Clarke, Weixing Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18360
Pdf URL: https://arxiv.org/pdf/2503.18360
Copy Paste: [[2503.18360]] J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain(https://arxiv.org/abs/2503.18360)
Keywords: attack, robust, large language model
Abstract: As the scale and capabilities of Large Language Models (LLMs) increase, their applications in knowledge-intensive fields such as legal domain have garnered widespread attention. However, it remains doubtful whether these LLMs make judgments based on domain knowledge for reasoning. If LLMs base their judgments solely on specific words or patterns, rather than on the underlying logic of the language, the ''LLM-as-judges'' paradigm poses substantial risks in the real-world applications. To address this question, we propose a method of legal knowledge injection attacks for robustness testing, thereby inferring whether LLMs have learned legal knowledge and reasoning logic. In this paper, we propose J&H: an evaluation framework for detecting the robustness of LLMs under knowledge injection attacks in the legal domain. The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks. To further this aim, we have attacked each part of the reasoning logic underlying these tasks (major premise, minor premise, and conclusion generation). We have collected mistakes that legal experts might make in judicial decisions in the real world, such as typos, legal synonyms, inaccurate external legal statutes retrieval. However, in real legal practice, legal experts tend to overlook these mistakes and make judgments based on logic. However, when faced with these errors, LLMs are likely to be misled by typographical errors and may not utilize logic in their judgments. We conducted knowledge injection attacks on existing general and domain-specific LLMs. Current LLMs are not robust against the attacks employed in our experiments. In addition we propose and compare several methods to enhance the knowledge robustness of LLMs.

Title: MaSS13K: A Matting-level Semantic Segmentation Benchmark

Authors: Chenxi Xie, Minghan Li, Hui Zeng, Jun Luo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18364
Pdf URL: https://arxiv.org/pdf/2503.18364
Copy Paste: [[2503.18364]] MaSS13K: A Matting-level Semantic Segmentation Benchmark(https://arxiv.org/abs/2503.18364)
Keywords: segmentation
Abstract: High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at this https URL.

Title: DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation

Authors: Raquel Vidaurre, Elena Garces, Dan Casas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18370
Pdf URL: https://arxiv.org/pdf/2503.18370
Copy Paste: [[2503.18370]] DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation(https://arxiv.org/abs/2503.18370)
Keywords: diffusion, generative
Abstract: We present a data-driven method for learning to generate animations of 3D garments using a 2D image diffusion model. In contrast to existing methods, typically based on fully connected networks, graph neural networks, or generative adversarial networks, which have difficulties to cope with parametric garments with fine wrinkle detail, our approach is able to synthesize high-quality 3D animations for a wide variety of garments and body shapes, while being agnostic to the garment mesh topology. Our key idea is to represent 3D garment deformations as a 2D layout-consistent texture that encodes 3D offsets with respect to a parametric garment template. Using this representation, we encode a large dataset of garments simulated in various motions and shapes and train a novel conditional diffusion model that is able to synthesize high-quality pose-shape-and-design dependent 3D garment deformations. Since our model is generative, we can synthesize various plausible deformations for a given target pose, shape, and design. Additionally, we show that we can further condition our model using an existing garment state, which enables the generation of temporally coherent sequences.

Title: Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs

Authors: Chang Gao, Kang Zhao, Jianfei Chen, Liping Jing
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18377
Pdf URL: https://arxiv.org/pdf/2503.18377
Copy Paste: [[2503.18377]] Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs(https://arxiv.org/abs/2503.18377)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emph{non-uniformity}, \emph{pruning metric dependency}, and \emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.

Title: Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model

Authors: Tianpei Zhang, Yiming Zhu, Jufeng Zhao, Guangmang Cui, Yuchen Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18378
Pdf URL: https://arxiv.org/pdf/2503.18378
Copy Paste: [[2503.18378]] Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model(https://arxiv.org/abs/2503.18378)
Keywords: extraction
Abstract: Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at this https URL.

Title: PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

Authors: Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, Gang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18382
Pdf URL: https://arxiv.org/pdf/2503.18382
Copy Paste: [[2503.18382]] PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition(https://arxiv.org/abs/2503.18382)
Keywords: robust
Abstract: Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(this https URL) and PaddleX(this https URL).

Title: RoCA: Robust Contrastive One-class Time Series Anomaly Detection with Contaminated Data

Authors: Xudong Mou, Rui Wang, Bo Li, Tianyu Wo, Jie Sun, Hui Wang, Xudong Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18385
Pdf URL: https://arxiv.org/pdf/2503.18385
Copy Paste: [[2503.18385]] RoCA: Robust Contrastive One-class Time Series Anomaly Detection with Contaminated Data(https://arxiv.org/abs/2503.18385)
Keywords: robust
Abstract: The accumulation of time-series signals and the absence of labels make time-series Anomaly Detection (AD) a self-supervised task of deep learning. Methods based on normality assumptions face the following three limitations: (1) A single assumption could hardly characterize the whole normality or lead to some deviation. (2) Some assumptions may go against the principle of AD. (3) Their basic assumption is that the training data is uncontaminated (free of anomalies), which is unrealistic in practice, leading to a decline in robustness. This paper proposes a novel robust approach, RoCA, which is the first to address all of the above three challenges, as far as we are aware. It fuses the separated assumptions of one-class classification and contrastive learning in a single training process to characterize a more complete so-called normality. Additionally, it monitors the training data and computes a carefully designed anomaly score throughout the training process. This score helps identify latent anomalies, which are then used to define the classification boundary, inspired by the concept of outlier exposure. The performance on AIOps datasets improved by 6% compared to when contamination was not considered (COCA). On two large and high-dimensional multivariate datasets, the performance increased by 5% to 10%. RoCA achieves the highest average performance on both univariate and multivariate datasets. The source code is available at this https URL.

Title: Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

Authors: Sicong Feng, Jielong Yang, Li Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18386
Pdf URL: https://arxiv.org/pdf/2503.18386
Copy Paste: [[2503.18386]] Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance(https://arxiv.org/abs/2503.18386)
Keywords: diffusion
Abstract: Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.

Title: PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes

Authors: Xinhua Xu, Hong Liu, Jianbing Wu, Jinfu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18393
Pdf URL: https://arxiv.org/pdf/2503.18393
Copy Paste: [[2503.18393]] PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes(https://arxiv.org/abs/2503.18393)
Keywords: diffusion, segmentation
Abstract: The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB-D cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.

Title: Solving Situation Puzzles with Large Language Model and External Reformulation

Authors: Kun Li, Xinwei Chen, Tianyou Song, Chengrui Zhou, Zhuoran Liu, Zhenyan Zhang, Jiangjian Guo, Qing Shan
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18394
Pdf URL: https://arxiv.org/pdf/2503.18394
Copy Paste: [[2503.18394]] Solving Situation Puzzles with Large Language Model and External Reformulation(https://arxiv.org/abs/2503.18394)
Keywords: large language model
Abstract: In recent years, large language models (LLMs) have shown an impressive ability to perform arithmetic and symbolic reasoning tasks. However, we found that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires multiple rounds of dialogue, especially when solving situation puzzles. Specifically, LLMs intend to ask very detailed questions focusing on a specific aspect or same/similar questions after several rounds of Q&As. To help LLMs get out of the above dilemma, we propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of Q&A or when the LLMs raise an incorrect guess. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles, highlighting the potential of strategic problem reformulation to enhance the reasoning capabilities of LLMs in complex interactive scenarios.

Title: Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning

Authors: Xusheng Cao, Haori Lu, Linlan Huang, Fei Yang, Xialei Liu, Ming-Ming Cheng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18403
Pdf URL: https://arxiv.org/pdf/2503.18403
Copy Paste: [[2503.18403]] Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning(https://arxiv.org/abs/2503.18403)
Keywords: generative
Abstract: Continual learning in computer vision faces the critical challenge of catastrophic forgetting, where models struggle to retain prior knowledge while adapting to new tasks. Although recent studies have attempted to leverage the generalization capabilities of pre-trained models to mitigate overfitting on current tasks, models still tend to forget details of previously learned categories as tasks progress, leading to misclassification. To address these limitations, we introduce a novel Knowledge Graph Enhanced Generative Multi-modal model (KG-GMM) that builds an evolving knowledge graph throughout the learning process. Our approach utilizes relationships within the knowledge graph to augment the class labels and assigns different relations to similar categories to enhance model differentiation. During testing, we propose a Knowledge Graph Augmented Inference method that locates specific categories by analyzing relationships within the generated text, thereby reducing the loss of detailed information about old classes when learning new knowledge and alleviating forgetting. Experiments demonstrate that our method effectively leverages relational information to help the model correct mispredictions, achieving state-of-the-art results in both conventional CIL and few-shot CIL settings, confirming the efficacy of knowledge graphs at preserving knowledge in the continual learning scenarios.

Title: Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Authors: Sherry X. Chen, Misha Sra, Pradeep Sen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18406
Pdf URL: https://arxiv.org/pdf/2503.18406
Copy Paste: [[2503.18406]] Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning(https://arxiv.org/abs/2503.18406)
Keywords: diffusion, generative
Abstract: Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at this https URL.

Title: VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Authors: Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Danqing Song, Qinghua Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18407
Pdf URL: https://arxiv.org/pdf/2503.18407
Copy Paste: [[2503.18407]] VTD-CLIP: Video-to-Text Discretization via Prompting CLIP(https://arxiv.org/abs/2503.18407)
Keywords: robust, interpretability
Abstract: Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at this https URL.

Title: U-REPA: Aligning Diffusion U-Nets to ViTs

Authors: Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18414
Pdf URL: https://arxiv.org/pdf/2503.18414
Copy Paste: [[2503.18414]] U-REPA: Aligning Diffusion U-Nets to ViTs(https://arxiv.org/abs/2503.18414)
Keywords: diffusion, transformer
Abstract: Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at this https URL.

Title: Panorama Generation From NFoV Image Done Right

Authors: Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18420
Pdf URL: https://arxiv.org/pdf/2503.18420
Copy Paste: [[2503.18420]] Panorama Generation From NFoV Image Done Right(https://arxiv.org/abs/2503.18420)
Keywords: diffusion
Abstract: Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.

Title: Breaking the Encoder Barrier for Seamless Video-Language Understanding

Authors: Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18422
Pdf URL: https://arxiv.org/pdf/2503.18422
Copy Paste: [[2503.18422]] Breaking the Encoder Barrier for Seamless Video-Language Understanding(https://arxiv.org/abs/2503.18422)
Keywords: large language model
Abstract: Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.

Title: Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Authors: Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, Ming Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18429
Pdf URL: https://arxiv.org/pdf/2503.18429
Copy Paste: [[2503.18429]] Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation(https://arxiv.org/abs/2503.18429)
Keywords: diffusion
Abstract: In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.

Title: Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning

Authors: Junsong Li, Jie Zhou, Yutao Yang, Bihao Zhan, Qianjun Pan, Yuyang Ding, Qin Chen, Jiang Bo, Xin Lin, Liang He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18432
Pdf URL: https://arxiv.org/pdf/2503.18432
Copy Paste: [[2503.18432]] Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning(https://arxiv.org/abs/2503.18432)
Keywords: large language model
Abstract: Automatic math correction aims to check students' solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement learning (RL)-based method to boost large language model (LLM) for step-level automatic math correction, named StepAMC. Particularly, we convert the step-level automatic math correction within the text classification task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we design a space-constrained policy network to improve the stability of RL. Then, we introduce a fine-grained reward network to convert the binary human feedback into a continuous value. We conduct extensive experiments over two benchmark datasets and the results show that our model outperforms the eleven strong baselines.

Title: A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Authors: Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, Ji Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18434
Pdf URL: https://arxiv.org/pdf/2503.18434
Copy Paste: [[2503.18434]] A Simple yet Effective Layout Token in Large Language Models for Document Understanding(https://arxiv.org/abs/2503.18434)
Keywords: large language model
Abstract: Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.

Title: On the Perception Bottleneck of VLMs for Chart Understanding

Authors: Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18435
Pdf URL: https://arxiv.org/pdf/2503.18435
Copy Paste: [[2503.18435]] On the Perception Bottleneck of VLMs for Chart Understanding(https://arxiv.org/abs/2503.18435)
Keywords: extraction
Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at this https URL.

Title: Distributionally Robust Federated Learning: An ADMM Algorithm

Authors: Wen Bai, Yi Wong, Xiao Qiao, Chin Pang Ho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18436
Pdf URL: https://arxiv.org/pdf/2503.18436
Copy Paste: [[2503.18436]] Distributionally Robust Federated Learning: An ADMM Algorithm(https://arxiv.org/abs/2503.18436)
Keywords: robust, federate
Abstract: Federated learning (FL) aims to train machine learning (ML) models collaboratively using decentralized data, bypassing the need for centralized data aggregation. Standard FL models often assume that all data come from the same unknown distribution. However, in practical situations, decentralized data frequently exhibit heterogeneity. We propose a novel FL model, Distributionally Robust Federated Learning (DRFL), that applies distributionally robust optimization to overcome the challenges posed by data heterogeneity and distributional ambiguity. We derive a tractable reformulation for DRFL and develop a novel solution method based on the alternating direction method of multipliers (ADMM) algorithm to solve this problem. Our experimental results demonstrate that DRFL outperforms standard FL models under data heterogeneity and ambiguity.

Title: ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Authors: Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18438
Pdf URL: https://arxiv.org/pdf/2503.18438
Copy Paste: [[2503.18438]] ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation(https://arxiv.org/abs/2503.18438)
Keywords: generative
Abstract: Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.

Title: Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness

Authors: Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18445
Pdf URL: https://arxiv.org/pdf/2503.18445
Copy Paste: [[2503.18445]] Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness(https://arxiv.org/abs/2503.18445)
Keywords: robust, segmentation
Abstract: Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-$mIoU^{Avg}_{EMM}$, $mIoU^{E}_{EMM}$, $mIoU^{Avg}_{RMM}$, and $mIoU^{E}_{RMM}$-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at this https URL.

Title: Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Authors: Jinho Jeong, Sangmin Han, Jinwoo Kim, Seon Joo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18446
Pdf URL: https://arxiv.org/pdf/2503.18446
Copy Paste: [[2503.18446]] Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models(https://arxiv.org/abs/2503.18446)
Keywords: diffusion
Abstract: In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at this https URL.

Title: InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Authors: Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, Min Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18454
Pdf URL: https://arxiv.org/pdf/2503.18454
Copy Paste: [[2503.18454]] InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment(https://arxiv.org/abs/2503.18454)
Keywords: diffusion, generative, large language model
Abstract: Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.

Title: Hiding Images in Diffusion Models by Editing Learned Score Functions

Authors: Haoyu Chen, Yunqiao Yang, Nan Zhong, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18459
Pdf URL: https://arxiv.org/pdf/2503.18459
Copy Paste: [[2503.18459]] Hiding Images in Diffusion Models by Editing Learned Score Functions(https://arxiv.org/abs/2503.18459)
Keywords: extraction, diffusion, generative
Abstract: Hiding data using neural networks (i.e., neural steganography) has achieved remarkable success across both discriminative classifiers and generative adversarial networks. However, the potential of data hiding in diffusion models remains relatively unexplored. Current methods exhibit limitations in achieving high extraction accuracy, model fidelity, and hiding efficiency due primarily to the entanglement of the hiding and extraction processes with multiple denoising diffusion steps. To address these, we describe a simple yet effective approach that embeds images at specific timesteps in the reverse diffusion process by editing the learned score functions. Additionally, we introduce a parameter-efficient fine-tuning method that combines gradient-based parameter selection with low-rank adaptation to enhance model fidelity and hiding efficiency. Comprehensive experiments demonstrate that our method extracts high-quality images at human-indistinguishable levels, replicates the original model behaviors at both sample and population levels, and embeds images orders of magnitude faster than prior methods. Besides, our method naturally supports multi-recipient scenarios through independent extraction channels.

Title: MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing

Authors: Lingting Zhu, Jingrui Ye, Runze Zhang, Zeyu Hu, Yingda Yin, Lanjiong Li, Jinnan Chen, Shengju Qian, Xin Wang, Qingmin Liao, Lequan Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18461
Pdf URL: https://arxiv.org/pdf/2503.18461
Copy Paste: [[2503.18461]] MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing(https://arxiv.org/abs/2503.18461)
Keywords: large language model
Abstract: Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties. 2) Leveraging multimodal large language models, we emulate artists' techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.

Title: PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models

Authors: Tadeusz Dziarmaga, Marcin Kądziołka, Artur Kasymov, Marcin Mazur
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18462
Pdf URL: https://arxiv.org/pdf/2503.18462
Copy Paste: [[2503.18462]] PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models(https://arxiv.org/abs/2503.18462)
Keywords: generative
Abstract: Deep generative models (DGMs) have caused a paradigm shift in the field of machine learning, yielding noteworthy advancements in domains such as image synthesis, natural language processing, and other related areas. However, a comprehensive evaluation of these models that accounts for the trichotomy between fidelity, diversity, and novelty in generated samples remains a formidable challenge. A recently introduced solution that has emerged as a promising approach in this regard is the Feature Likelihood Divergence (FLD), a method that offers a theoretically motivated practical tool, yet also exhibits some computational challenges. In this paper, we propose PALATE, a novel enhancement to the evaluation of DGMs that addresses limitations of existing metrics. Our approach is based on a peculiar application of the law of total expectation to random variables representing accessible real data. When combined with the MMD baseline metric and DINOv2 feature extractor, PALATE offers a holistic evaluation framework that matches or surpasses state-of-the-art solutions while providing superior computational efficiency and scalability to large-scale datasets. Through a series of experiments, we demonstrate the effectiveness of the PALATE enhancement, contributing a computationally efficient, holistic evaluation approach that advances the field of DGMs assessment, especially in detecting sample memorization and evaluating generalization capabilities.

Title: CFReID: Continual Few-shot Person Re-Identification

Authors: Hao Ni, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18469
Pdf URL: https://arxiv.org/pdf/2503.18469
Copy Paste: [[2503.18469]] CFReID: Continual Few-shot Person Re-Identification(https://arxiv.org/abs/2503.18469)
Keywords: privacy
Abstract: Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's state-of-the-art performance, which requires 700 to 1,000 IDs.

Title: Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

Authors: Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18478
Pdf URL: https://arxiv.org/pdf/2503.18478
Copy Paste: [[2503.18478]] Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding(https://arxiv.org/abs/2503.18478)
Keywords: large language model
Abstract: Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.

Title: Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification

Authors: Zequn Zeng, Yudi Su, Jianqiao Sun, Tiansheng Wen, Hao Zhang, Zhengjue Wang, Bo Chen, Hongwei Liu, Jiawei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18483
Pdf URL: https://arxiv.org/pdf/2503.18483
Copy Paste: [[2503.18483]] Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification(https://arxiv.org/abs/2503.18483)
Keywords: large language model
Abstract: Concept-based models can map black-box representations to human-understandable concepts, which makes the decision-making process more transparent and then allows users to understand the reason behind predictions. However, domain-specific concepts often impact the final predictions, which subsequently undermine the model generalization capabilities, and prevent the model from being used in high-stake applications. In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. In particular, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via domain descriptors while prompting large Language Models (LLMs) can easily simulate a wide range of descriptors of unseen visual domains. Then, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. Notably, the DDO regularizer is agnostic to the design of concept-based models and we integrate it into several prevailing models. Through evaluation of domain generalization on four standard benchmarks and three newly introduced benchmarks, we demonstrate that DDO can significantly improve the out-of-distribution (OOD) generalization over the previous state-of-the-art concept-based this http URL code is available at this https URL.

Title: PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

Authors: Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18484
Pdf URL: https://arxiv.org/pdf/2503.18484
Copy Paste: [[2503.18484]] PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model(https://arxiv.org/abs/2503.18484)
Keywords: fair
Abstract: Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at this https URL .

Title: MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Authors: Shuo Yang, Siwen Luo, Soyeon Caren Han, Eduard Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18491
Pdf URL: https://arxiv.org/pdf/2503.18491
Copy Paste: [[2503.18491]] MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering(https://arxiv.org/abs/2503.18491)
Keywords: robust
Abstract: Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

Title: Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression

Authors: Stefan Rass, Martin Dallinger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18497
Pdf URL: https://arxiv.org/pdf/2503.18497
Copy Paste: [[2503.18497]] Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression(https://arxiv.org/abs/2503.18497)
Keywords: explainability
Abstract: Artificial intelligence models trained from data can only be as good as the underlying data is. Biases in training data propagating through to the output of a machine learning model are a well-documented and well-understood phenomenon, but the machinery to prevent these undesired effects is much less developed. Efforts to ensure data is clean during collection, such as using bias-aware sampling, are most effective when the entity controlling data collection also trains the AI. In cases where the data is already available, how do we find out if the data was already manipulated, i.e., ``poisoned'', so that an undesired behavior would be trained into a machine learning model? This is a challenge fundamentally different to (just) improving approximation accuracy or efficiency, and we provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models (of any kind). Unlike the well-studied problem of approximating data using fuzzy rules that are generated from the data, our method hinges on a prior definition of rules to happen before seeing the data to be tested. Therefore, the proposed method can also discover hidden error patterns, which may also have substantial influence. Our approach extends the abilities of conventional statistical testing by letting the ``test-condition'' be any Boolean condition to describe a pattern in the data, whose presence we wish to determine. The method puts fuzzy inference into a regression model, to get the best of the two: explainability from fuzzy logic with statistical properties and diagnostics from the regression, and finally also being applicable to ``small data'', hence not requiring large datasets as deep learning methods do. We provide an open source implementation for demonstration and experiments.

Title: Autoregressive Language Models for Knowledge Base Population: A case study in the space mission domain

Authors: Andrés García-Silva, José Manuel Gómez-Pérez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18502
Pdf URL: https://arxiv.org/pdf/2503.18502
Copy Paste: [[2503.18502]] Autoregressive Language Models for Knowledge Base Population: A case study in the space mission domain(https://arxiv.org/abs/2503.18502)
Keywords: large language model
Abstract: Knowledge base population KBP plays a crucial role in populating and maintaining knowledge bases up-to-date in organizations by leveraging domain corpora. Motivated by the increasingly large context windows supported by large language models, we propose to fine-tune an autoregressive language model for end-toend KPB. Our case study involves the population of a space mission knowledge graph. To fine-tune the model we generate a dataset for end-to-end KBP tapping into existing domain resources. Our case study shows that fine-tuned language models of limited size can achieve competitive and even higher accuracy than larger models in the KBP task. Smaller models specialized for KBP offer affordable deployment and lower-cost inference. Moreover, KBP specialist models do not require the ontology to be included in the prompt, allowing for more space in the context for additional input text or output serialization.

Title: Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations

Authors: Jiate Li, Meng Pang, Yun Dong, Binghui Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.18503
Pdf URL: https://arxiv.org/pdf/2503.18503
Copy Paste: [[2503.18503]] Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations(https://arxiv.org/abs/2503.18503)
Keywords: defense, attack, robust
Abstract: Graph neural networks (GNNs) are becoming the de facto method to learn on the graph data and have achieved the state-of-the-art on node and graph classification tasks. However, recent works show GNNs are vulnerable to training-time poisoning attacks -- marginally perturbing edges, nodes, or/and node features of training graph(s) can largely degrade GNNs' testing performance. Most previous defenses against graph poisoning attacks are empirical and are soon broken by adaptive / stronger ones. A few provable defenses provide robustness guarantees, but have large gaps when applied in practice: 1) restrict the attacker on only one type of perturbation; 2) design for a particular GNN architecture or task; and 3) robustness guarantees are not 100\% accurate. In this work, we bridge all these gaps by developing PGNNCert, the first certified defense of GNNs against poisoning attacks under arbitrary (edge, node, and node feature) perturbations with deterministic robustness guarantees. Extensive evaluations on multiple node and graph classification datasets and GNNs demonstrate the effectiveness of PGNNCert to provably defend against arbitrary poisoning perturbations. PGNNCert is also shown to significantly outperform the state-of-the-art certified defenses against edge perturbation or node perturbation during GNN training.

Title: Can Text-to-Video Generation help Video-Language Alignment?

Authors: Luca Zanella, Massimiliano Mancini, Willi Menapace, Sergey Tulyakov, Yiming Wang, Elisa Ricci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18507
Pdf URL: https://arxiv.org/pdf/2503.18507
Copy Paste: [[2503.18507]] Can Text-to-Video Generation help Video-Language Alignment?(https://arxiv.org/abs/2503.18507)
Keywords: large language model
Abstract: Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Title: Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Authors: Leheng Zhang, Weiyi You, Kexuan Shi, Shuhang Gu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18512
Pdf URL: https://arxiv.org/pdf/2503.18512
Copy Paste: [[2503.18512]] Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model(https://arxiv.org/abs/2503.18512)
Keywords: diffusion
Abstract: Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various datasets, both quantitatively and qualitatively.

Title: SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

Authors: Raúl Ortega, José Manuel Gómez-Pérez
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2503.18526
Pdf URL: https://arxiv.org/pdf/2503.18526
Copy Paste: [[2503.18526]] SciClaims: An End-to-End Generative System for Biomedical Claim Analysis(https://arxiv.org/abs/2503.18526)
Keywords: extraction, generative, large language model
Abstract: Validating key claims in scientific literature, particularly in biomedical research, is essential for ensuring accuracy and advancing knowledge. This process is critical in sectors like the pharmaceutical industry, where rapid scientific progress requires automation and deep domain expertise. However, current solutions have significant limitations. They lack end-to-end pipelines encompassing all claim extraction, evidence retrieval, and verification steps; rely on complex NLP and information retrieval pipelines prone to multiple failure points; and often fail to provide clear, user-friendly justifications for claim verification outcomes. To address these challenges, we introduce SciClaims, an advanced system powered by state-of-the-art large language models (LLMs) that seamlessly integrates the entire scientific claim analysis process. SciClaims outperforms previous approaches in both claim extraction and verification without requiring additional fine-tuning, setting a new benchmark for automated scientific claim analysis.

Title: AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction

Authors: Soulaimene Turki, Daniel Panangian, Houda Chaabouni-Chouayakh, Ksenia Bittner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18527
Pdf URL: https://arxiv.org/pdf/2503.18527
Copy Paste: [[2503.18527]] AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction(https://arxiv.org/abs/2503.18527)
Keywords: diffusion
Abstract: Three-dimensional urban reconstruction of buildings from single-view images has attracted significant attention over the past two decades. However, recent methods primarily focus on rooftops from aerial images, often overlooking essential geometrical details. Additionally, there is a notable lack of datasets containing complete 3D point clouds for entire buildings, along with challenges in obtaining reliable camera pose information for aerial images. This paper addresses these challenges by presenting a novel methodology, AIM2PC , which utilizes our generated dataset that includes complete 3D point clouds and determined camera poses. Our approach takes features from a single aerial image as input and concatenates them with essential additional conditions, such as binary masks and Sobel edge maps, to enable more edge-aware reconstruction. By incorporating a point cloud diffusion model based on Centered denoising Diffusion Probabilistic Models (CDPM), we project these concatenated features onto the partially denoised point cloud using our camera poses at each diffusion step. The proposed method is able to reconstruct the complete 3D building point cloud, including wall information and demonstrates superior performance compared to existing baseline techniques. To allow further comparisons with our methodology the dataset has been made available at this https URL

Title: DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

Authors: Erjian Guo, Zhen Zhao, Zicheng Wang, Tong Chen, Yunyi Liu, Luping Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18536
Pdf URL: https://arxiv.org/pdf/2503.18536
Copy Paste: [[2503.18536]] DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels(https://arxiv.org/abs/2503.18536)
Keywords: robust, diffusion
Abstract: Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.

Title: Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish

Authors: Ashenafi Zebene Woldaregay, Jørgen Aarmo Lund, Phuong Dinh Ngo, Mariyam Tayefi, Joel Burman, Stine Hansen, Martin Hylleholt Sillesen, Hercules Dalianis, Robert Jenssen, Lindsetmo Rolf Ole, Karl Øyvind Mikalsen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18539
Pdf URL: https://arxiv.org/pdf/2503.18539
Copy Paste: [[2503.18539]] Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish(https://arxiv.org/abs/2503.18539)
Keywords: transformer
Abstract: Background: Clinical natural language processing (NLP) refers to the use of computational methods for extracting, processing, and analyzing unstructured clinical text data, and holds a huge potential to transform healthcare in various clinical tasks. Objective: The study aims to perform a systematic review to comprehensively assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Method: A literature search was conducted in various online databases including PubMed, ScienceDirect, Google Scholar, ACM digital library, and IEEE Xplore between December 2022 and February 2024. Further, relevant references to the included articles were also used to solidify our search. The final pool includes articles that conducted clinical NLP in the mainland Scandinavian languages and were published in English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21) focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language. Generally, the review identified positive developments across the region despite some observable gaps and disparities between the languages. There are substantial disparities in the level of adoption of transformer-based models. In essential tasks such as de-identification, there is significantly less research activity focusing on Norwegian and Danish compared to Swedish text. Further, the review identified a low level of sharing resources such as data, experimentation code, pre-trained models, and rate of adaptation and transfer learning in the region. Conclusion: The review presented a comprehensive assessment of the state-of-the-art Clinical NLP for electronic health records (EHR) text in mainland Scandinavian languages and, highlighted the potential barriers and challenges that hinder the rapid advancement of the field in the region.

Title: HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications

Authors: Guneet Mutreja, Philipp Schuegraf, Ksenia Bittner
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18540
Pdf URL: https://arxiv.org/pdf/2503.18540
Copy Paste: [[2503.18540]] HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications(https://arxiv.org/abs/2503.18540)
Keywords: segmentation
Abstract: Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.

Title: Distilling Stereo Networks for Performant and Efficient Leaner Networks

Authors: Rafia Rahim, Samuel Woerz, Andreas Zell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18544
Pdf URL: https://arxiv.org/pdf/2503.18544
Copy Paste: [[2503.18544]] Distilling Stereo Networks for Performant and Efficient Leaner Networks(https://arxiv.org/abs/2503.18544)
Keywords: segmentation
Abstract: Knowledge distillation has been quite popular in vision for tasks like classification and segmentation however not much work has been done for distilling state-of-the-art stereo matching methods despite their range of applications. One of the reasons for its lack of use in stereo matching networks is due to the inherent complexity of these networks, where a typical network is composed of multiple two- and three-dimensional modules. In this work, we systematically combine the insights from state-of-the-art stereo methods with general knowledge-distillation techniques to develop a joint framework for stereo networks distillation with competitive results and faster inference. Moreover, we show, via a detailed empirical analysis, that distilling knowledge from the stereo network requires careful design of the complete distillation pipeline starting from backbone to the right selection of distillation points and corresponding loss functions. This results in the student networks that are not only leaner and faster but give excellent performance . For instance, our student network while performing better than the performance oriented methods like PSMNet [1], CFNet [2], and LEAStereo [3]) on benchmark SceneFlow dataset, is 8x, 5x, and 8x faster respectively. Furthermore, compared to speed oriented methods having inference time less than 100ms, our student networks perform better than all the tested methods. In addition, our student network also shows better generalization capabilities when tested on unseen datasets like ETH3D and Middlebury.

Title: Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition

Authors: Lubnaa Abdur Rahman, Ioannis Papathanail, Lorenzo Brigato, Stavroula Mougiakakou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18548
Pdf URL: https://arxiv.org/pdf/2503.18548
Copy Paste: [[2503.18548]] Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition(https://arxiv.org/abs/2503.18548)
Keywords: transformer
Abstract: Food recognition models often struggle to distinguish between seen and unseen samples, frequently misclassifying samples from unseen categories by assigning them an in-distribution (ID) label. This misclassification presents significant challenges when deploying these models in real-world applications, particularly within automatic dietary assessment systems, where incorrect labels can lead to cascading errors throughout the system. Ideally, such models should prompt the user when an unknown sample is encountered, allowing for corrective action. Given no prior research exploring food recognition in real-world settings, in this work we conduct an empirical analysis of various post-hoc out-of-distribution (OOD) detection methods for fine-grained food recognition. Our findings indicate that virtual logit matching (ViM) performed the best overall, likely due to its combination of logits and feature-space representations. Additionally, our work reinforces prior notions in the OOD domain, noting that models with higher ID accuracy performed better across the evaluated OOD detection methods. Furthermore, transformer-based architectures consistently outperformed convolution-based models in detecting OOD samples across various methods.

Title: The (Un)suitability of Passwords and Password Managers in Virtual Reality

Authors: Emiram Kablo, Yorick Last, Patricia Arias Cabarcos, Melanie Volkamer
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2503.18550
Pdf URL: https://arxiv.org/pdf/2503.18550
Copy Paste: [[2503.18550]] The (Un)suitability of Passwords and Password Managers in Virtual Reality(https://arxiv.org/abs/2503.18550)
Keywords: secure, biometric
Abstract: As Virtual Reality (VR) expands into fields like healthcare and education, ensuring secure and user-friendly authentication becomes essential. Traditional password entry methods in VR are cumbersome and insecure, making password managers (PMs) a potential solution. To explore this field, we conducted a user study (n=126 VR users) where participants expressed a strong preference for simpler passwords and showed interest in biometric authentication and password managers. On these grounds, we provide the first in-depth evaluation of PMs in VR. We report findings from 91 cognitive walkthroughs, revealing that while PMs improve usability, they are not yet ready for prime time. Key features like cross-app autofill are missing, and user experiences highlight the need for better solutions. Based on consolidated user views and expert analysis, we make recommendations on how to move forward in improving VR authentication systems, ultimately creating more practical solutions for this growing field.

Title: Discriminative protein sequence modelling with Latent Space Diffusion

Authors: Eoin Quinn, Ghassene Jebali, Maxime Seince, Oliver Bent
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18551
Pdf URL: https://arxiv.org/pdf/2503.18551
Copy Paste: [[2503.18551]] Discriminative protein sequence modelling with Latent Space Diffusion(https://arxiv.org/abs/2503.18551)
Keywords: diffusion
Abstract: We explore a framework for protein sequence representation learning that decomposes the task between manifold learning and distributional modelling. Specifically we present a Latent Space Diffusion architecture which combines a protein sequence autoencoder with a denoising diffusion model operating on its latent space. We obtain a one-parameter family of learned representations from the diffusion model, along with the autoencoder's latent representation. We propose and evaluate two autoencoder architectures: a homogeneous model forcing amino acids of the same type to be identically distributed in the latent space, and an inhomogeneous model employing a noise-based variant of masking. As a baseline we take a latent space learned by masked language modelling, and evaluate discriminative capability on a range of protein property prediction tasks. Our finding is twofold: the diffusion models trained on both our proposed variants display higher discriminative power than the one trained on the masked language model baseline, none of the diffusion representations achieve the performance of the masked language model embeddings themselves.

Title: EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

Authors: Qiang Qu, Ming Li, Xiaoming Chen, Tongliang Liu
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.18552
Pdf URL: https://arxiv.org/pdf/2503.18552
Copy Paste: [[2503.18552]] EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation(https://arxiv.org/abs/2503.18552)
Keywords: robust, diffusion
Abstract: Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.

Title: ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset

Authors: Zihao Chen, Hsuanyu Wu, Chi-Hsi Kung, Yi-Ting Chen, Yan-Tsung Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18553
Pdf URL: https://arxiv.org/pdf/2503.18553
Copy Paste: [[2503.18553]] ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset(https://arxiv.org/abs/2503.18553)
Keywords: segmentation
Abstract: Traffic Atomic Activity which describes traffic patterns for topological intersection dynamics is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects' activities. We further provide comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognizing atomic activity in aerial view. Our source code and dataset are available at this https URL

Title: AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Authors: Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18559
Pdf URL: https://arxiv.org/pdf/2503.18559
Copy Paste: [[2503.18559]] AMD-Hummingbird: Towards an Efficient Text-to-Video Model(https://arxiv.org/abs/2503.18559)
Keywords: large language model
Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

Title: Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Authors: Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Zahra Atf, Peter Lewis, Girish Nadkarni, Ali Soroush
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18562
Pdf URL: https://arxiv.org/pdf/2503.18562
Copy Paste: [[2503.18562]] Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models(https://arxiv.org/abs/2503.18562)
Keywords: large language model
Abstract: This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

Title: Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18565
Pdf URL: https://arxiv.org/pdf/2503.18565
Copy Paste: [[2503.18565]] Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures(https://arxiv.org/abs/2503.18565)
Keywords: transformer, large language model
Abstract: The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.

Title: Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning

Authors: Hadi Mohammadi, Ehsan Nazerfard, Mostafa Haghir Chehreghani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18569
Pdf URL: https://arxiv.org/pdf/2503.18569
Copy Paste: [[2503.18569]] Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning(https://arxiv.org/abs/2503.18569)
Keywords: security, generative
Abstract: Imbalanced data represent a distribution with more frequencies of one class (majority) than the other (minority). This phenomenon occurs across various domains, such as security, medical care and human activity. In imbalanced learning, classification algorithms are typically inclined to classify the majority class accurately, resulting in artificially high accuracy rates. As a result, many minority samples are mistakenly labelled as majority-class instances, resulting in a bias that benefits the majority class. This study presents a framework based on boundary anchor samples to tackle the imbalance learning challenge. First, we select and use anchor samples to train a multilayer perceptron (MLP) classifier, which acts as a prior knowledge model and aids the adversarial and contrastive learning procedures. Then, we designed a novel deep generative model called Anchor Stabilized Conditional Generative Adversarial Network or Anch-SCGAN in short. Anch-SCGAN is supported with two generators for the minority and majority classes and a discriminator incorporating additional class-specific information from the pre-trained feature extractor MLP. In addition, we facilitate the generator's training procedure in two ways. First, we define a new generator loss function based on reprocessed anchor samples and contrastive learning. Second, we apply a scoring strategy to stabilize the adversarial training part in generators. We train Anch-SCGAN and further finetune it with anchor samples to improve the precision of the generated samples. Our experiments on 16 real-world imbalanced datasets illustrate that Anch-SCGAN outperforms the renowned methods in imbalanced learning.

Title: Adapting Video Diffusion Models for Time-Lapse Microscopy

Authors: Alexander Holmberg, Nils Mechtel, Wei Ouyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18583
Pdf URL: https://arxiv.org/pdf/2503.18583
Copy Paste: [[2503.18583]] Adapting Video Diffusion Models for Time-Lapse Microscopy(https://arxiv.org/abs/2503.18583)
Keywords: diffusion, generative
Abstract: We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.

Title: Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling

Authors: Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18589
Pdf URL: https://arxiv.org/pdf/2503.18589
Copy Paste: [[2503.18589]] Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling(https://arxiv.org/abs/2503.18589)
Keywords: diffusion
Abstract: Multi-agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of uncertainty. Moreover, popular multi-modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a \textbf{unified} diffusion model designed to handle trajectory completion while providing state-wise \textbf{uncertainty} estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log-likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post-processing to enable \textbf{error probability} estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state-of-the-art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), highlighting the effectiveness of uncertainty and error probability estimation. Video at this https URL

Title: ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP

Authors: Guillem García Subies, Álvaro Barbero Jiménez, Paloma Martínez Fernández
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18594
Pdf URL: https://arxiv.org/pdf/2503.18594
Copy Paste: [[2503.18594]] ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP(https://arxiv.org/abs/2503.18594)
Keywords: robust
Abstract: We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.

Title: LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Authors: Yihan Wang, Peiyu Liu, Xin Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18596
Pdf URL: https://arxiv.org/pdf/2503.18596
Copy Paste: [[2503.18596]] LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL(https://arxiv.org/abs/2503.18596)
Keywords: robust, extraction
Abstract: Schema linking is a critical bottleneck in achieving human-level performance in Text-to-SQL tasks, particularly in real-world large-scale multi-database scenarios. Addressing schema linking faces two major challenges: (1) Database Retrieval: selecting the correct database from a large schema pool in multi-database settings, while filtering out irrelevant ones. (2) Schema Item Grounding: accurately identifying the relevant tables and columns from within a large and redundant schema for SQL generation. To address this, we introduce LinkAlign, a novel framework that can effectively adapt existing baselines to real-world environments by systematically addressing schema linking. Our framework comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. We evaluate our method performance of schema linking on the SPIDER and BIRD benchmarks, and the ability to adapt existing Text-to-SQL models to real-world environments on the SPIDER 2.0-lite benchmark. Experiments show that LinkAlign outperforms existing baselines in multi-database settings, demonstrating its effectiveness and robustness. On the other hand, our method ranks highest among models excluding those using long chain-of-thought reasoning LLMs. This work bridges the gap between current research and real-world scenarios, providing a practical solution for robust and scalable schema linking. The codes are available at this https URL.

Title: LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Authors: Jong Myoung Kim, Young-Jun Lee, Ho-Jin Choi, Sangkeun Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18603
Pdf URL: https://arxiv.org/pdf/2503.18603
Copy Paste: [[2503.18603]] LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment(https://arxiv.org/abs/2503.18603)
Keywords: large language model
Abstract: While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

Title: Adventurer: Exploration with BiGAN for Deep Reinforcement Learning

Authors: Yongshuai Liu, Xin Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18612
Pdf URL: https://arxiv.org/pdf/2503.18612
Copy Paste: [[2503.18612]] Adventurer: Exploration with BiGAN for Deep Reinforcement Learning(https://arxiv.org/abs/2503.18612)
Keywords: generative
Abstract: Recent developments in deep reinforcement learning have been very successful in learning complex, previously intractable problems. Sample efficiency and local optimality, however, remain significant challenges. To address these challenges, novelty-driven exploration strategies have emerged and shown promising potential. Unfortunately, no single algorithm outperforms all others in all tasks and most of them struggle with tasks with high-dimensional and complex observations. In this work, we propose Adventurer, a novelty-driven exploration algorithm that is based on Bidirectional Generative Adversarial Networks (BiGAN), where BiGAN is trained to estimate state novelty. Intuitively, a generator that has been trained on the distribution of visited states should only be able to generate a state coming from the distribution of visited states. As a result, novel states using the generator to reconstruct input states from certain latent representations would lead to larger reconstruction errors. We show that BiGAN performs well in estimating state novelty for complex observations. This novelty estimation method can be combined with intrinsic-reward-based exploration. Our empirical results show that Adventurer produces competitive results on a range of popular benchmark tasks, including continuous robotic manipulation tasks (e.g. Mujoco robotics) and high-dimensional image-based tasks (e.g. Atari games).

Title: Generative Dataset Distillation using Min-Max Diffusion Model

Authors: Junqiao Fan, Yunjiao Zhou, Min Chang Jordan Ren, Jianfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18626
Pdf URL: https://arxiv.org/pdf/2503.18626
Copy Paste: [[2503.18626]] Generative Dataset Distillation using Min-Max Diffusion Model(https://arxiv.org/abs/2503.18626)
Keywords: diffusion, generative
Abstract: In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset's diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved $2^{nd}$ place in the generative track of \href{this https URL}{The First Dataset Distillation Challenge of ECCV2024}, demonstrating its superior performance.

Title: Dig2DIG: Dig into Diffusion Information Gains for Image Fusion

Authors: Bing Cao, Baoshuo Cai, Changqing Zhang, Qinghua Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18627
Pdf URL: https://arxiv.org/pdf/2503.18627
Copy Paste: [[2503.18627]] Dig2DIG: Dig into Diffusion Information Gains for Image Fusion(https://arxiv.org/abs/2503.18627)
Keywords: diffusion, generative
Abstract: Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.

Title: Robust Lane Detection with Wavelet-Enhanced Context Modeling and Adaptive Sampling

Authors: Kunyang Li, Ming Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18631
Pdf URL: https://arxiv.org/pdf/2503.18631
Copy Paste: [[2503.18631]] Robust Lane Detection with Wavelet-Enhanced Context Modeling and Adaptive Sampling(https://arxiv.org/abs/2503.18631)
Keywords: robust
Abstract: Lane detection is critical for autonomous driving and ad-vanced driver assistance systems (ADAS). While recent methods like CLRNet achieve strong performance, they struggle under adverse con-ditions such as extreme weather, illumination changes, occlusions, and complex curves. We propose a Wavelet-Enhanced Feature Pyramid Net-work (WE-FPN) to address these challenges. A wavelet-based non-local block is integrated before the feature pyramid to improve global context modeling, especially for occluded and curved lanes. Additionally, we de-sign an adaptive preprocessing module to enhance lane visibility under poor lighting. An attention-guided sampling strategy further reffnes spa-tial features, boosting accuracy on distant and curved lanes. Experiments on CULane and TuSimple demonstrate that our approach signiffcantly outperforms baselines in challenging scenarios, achieving better robust-ness and accuracy in real-world driving conditions.

Title: Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Authors: Nina Shvetsova, Arsha Nagrani, Bernt Schiele, Hilde Kuehne, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18637
Pdf URL: https://arxiv.org/pdf/2503.18637
Copy Paste: [[2503.18637]] Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks(https://arxiv.org/abs/2503.18637)
Keywords: robust
Abstract: We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Title: ZeroLM: Data-Free Transformer Architecture Search for Language Models

Authors: Zhen-Song Chen, Hong-Wei Ding, Xian-Jia Wang, Witold Pedrycz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18646
Pdf URL: https://arxiv.org/pdf/2503.18646
Copy Paste: [[2503.18646]] ZeroLM: Data-Free Transformer Architecture Search for Language Models(https://arxiv.org/abs/2503.18646)
Keywords: robust, data-free, transformer
Abstract: Neural architecture search (NAS) provides a systematic framework for automating the design of neural network architectures, yet its widespread adoption is hindered by prohibitive computational requirements. Existing zero-cost proxy methods, while reducing search overhead, demonstrate inadequate performance in architecture ranking tasks, particularly for Transformer-based models where they often underperform simple parameter counting metrics. Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics computation while decomposing Transformer architectures into functionally distinct sub-modules, thereby optimizing the balance of their contributions to overall performance. Our comprehensive evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark. The proposed method exhibits exceptional computational efficiency while maintaining robust performance across diverse NAS benchmark tasks, offering a practical solution for large-scale architecture search.

Title: Robust face recognition based on the wing loss and the $\ell_1$ regularization

Authors: Yaoyao Yun, Jianwen Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18652
Pdf URL: https://arxiv.org/pdf/2503.18652
Copy Paste: [[2503.18652]] Robust face recognition based on the wing loss and the $\ell_1$ regularization(https://arxiv.org/abs/2503.18652)
Keywords: robust
Abstract: In recent years, sparse sampling techniques based on regression analysis have witnessed extensive applications in face recognition research. Presently, numerous sparse sampling models based on regression analysis have been explored by various researchers. Nevertheless, the recognition rates of the majority of these models would be significantly decreased when confronted with highly occluded and highly damaged face images. In this paper, a new wing-constrained sparse coding model(WCSC) and its weighted version(WWCSC) are introduced, so as to deal with the face recognition problem in complex circumstances, where the alternating direction method of multipliers (ADMM) algorithm is employed to solve the corresponding minimization problems. In addition, performances of the proposed method are examined based on the four well-known facial databases, namely the ORL facial database, the Yale facial database, the AR facial database and the FERET facial database. Also, compared to the other methods in the literatures, the WWCSC has a very high recognition rate even in complex situations where face images have high occlusion or high damage, which illustrates the robustness of the WWCSC method in facial recognition.

Title: Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

Authors: Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18665
Pdf URL: https://arxiv.org/pdf/2503.18665
Copy Paste: [[2503.18665]] Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark(https://arxiv.org/abs/2503.18665)
Keywords: large language model
Abstract: The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-wise Multi-dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at this https URL.

Title: Structure-Aware Correspondence Learning for Relative Pose Estimation

Authors: Yihan Chen, Wenfei Yang, Huan Ren, Shifeng Zhang, Tianzhu Zhang, Feng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18671
Pdf URL: https://arxiv.org/pdf/2503.18671
Copy Paste: [[2503.18671]] Structure-Aware Correspondence Learning for Relative Pose Estimation(https://arxiv.org/abs/2503.18671)
Keywords: extraction
Abstract: Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.

Title: Any6D: Model-free 6D Pose Estimation of Novel Objects

Authors: Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, Kuk-Jin Yoon
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18673
Pdf URL: https://arxiv.org/pdf/2503.18673
Copy Paste: [[2503.18673]] Any6D: Model-free 6D Pose Estimation of Novel Objects(https://arxiv.org/abs/2503.18673)
Keywords: robust
Abstract: We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: this https URL

Title: Human Motion Unlearning

Authors: Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18674
Pdf URL: https://arxiv.org/pdf/2503.18674
Copy Paste: [[2503.18674]] Human Motion Unlearning(https://arxiv.org/abs/2503.18674)
Keywords: diffusion, generative
Abstract: We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., ``kicking" is ``loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: \href{this https URL}{this https URL}.

Title: NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

Authors: Tianyi Wang, Harry Cheng, Xiao Zhang, Yinglong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18678
Pdf URL: https://arxiv.org/pdf/2503.18678
Copy Paste: [[2503.18678]] NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping(https://arxiv.org/abs/2503.18678)
Keywords: protect, defense, extraction, generative
Abstract: Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.

Title: Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18681
Pdf URL: https://arxiv.org/pdf/2503.18681
Copy Paste: [[2503.18681]] Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models(https://arxiv.org/abs/2503.18681)
Keywords: large language model
Abstract: Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.

Title: OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

Authors: Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Zeyu Zhang, Yue Huang, Kun Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18695
Pdf URL: https://arxiv.org/pdf/2503.18695
Copy Paste: [[2503.18695]] OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad(https://arxiv.org/abs/2503.18695)
Keywords: attack, robust, extraction
Abstract: Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.

Title: LLaVAction: evaluating and training multi-modal large language models for action recognition

Authors: Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18712
Pdf URL: https://arxiv.org/pdf/2503.18712
Copy Paste: [[2503.18712]] LLaVAction: evaluating and training multi-modal large language models for action recognition(https://arxiv.org/abs/2503.18712)
Keywords: large language model
Abstract: Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: this https URL.

Title: GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting

Authors: Lijiang Li, Jinglu Wang, Xiang Ming, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18718
Pdf URL: https://arxiv.org/pdf/2503.18718
Copy Paste: [[2503.18718]] GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting(https://arxiv.org/abs/2503.18718)
Keywords: robust, watermark, generative
Abstract: In the Generative AI era, safeguarding 3D models has become increasingly urgent. While invisible watermarking is well-established for 2D images with encoder-decoder frameworks, generalizable and robust solutions for 3D remain elusive. The main difficulty arises from the renderer between the 3D encoder and 2D decoder, which disrupts direct gradient flow and complicates training. Existing 3D methods typically rely on per-scene iterative optimization, resulting in time inefficiency and limited generalization. In this work, we propose a single-pass watermarking approach for 3D Gaussian Splatting (3DGS), a well-known yet underexplored representation for watermarking. We identify two major challenges: (1) ensuring effective training generalized across diverse 3D models, and (2) reliably extracting watermarks from free-view renderings, even under distortions. Our framework, named GS-Marker, incorporates a 3D encoder to embed messages, distortion layers to enhance resilience against various distortions, and a 2D decoder to extract watermarks from renderings. A key innovation is the Adaptive Marker Control mechanism that adaptively perturbs the initially optimized 3DGS, escaping local minima and improving both training stability and convergence. Extensive experiments show that GS-Marker outperforms per-scene training approaches in terms of decoding accuracy and model fidelity, while also significantly reducing computation time.

Title: Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

Authors: Cong Liu, Liang Hou, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18719
Pdf URL: https://arxiv.org/pdf/2503.18719
Copy Paste: [[2503.18719]] Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings(https://arxiv.org/abs/2503.18719)
Keywords: diffusion, transformer
Abstract: Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.

Title: Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Authors: Chris Pedersen, Laure Zanna, Joan Bruna
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.18731
Pdf URL: https://arxiv.org/pdf/2503.18731
Copy Paste: [[2503.18731]] Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos(https://arxiv.org/abs/2503.18731)
Keywords: diffusion
Abstract: Autoregressive surrogate models (or \textit{emulators}) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time, however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in large-scale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we call \textit{thermalization}. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by an order of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.

Title: SFDLA: Source-Free Document Layout Analysis

Authors: Sebastian Tewes, Yufan Chen, Omar Moured, Jiaming Zhang, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18742
Pdf URL: https://arxiv.org/pdf/2503.18742
Copy Paste: [[2503.18742]] SFDLA: Source-Free Document Layout Analysis(https://arxiv.org/abs/2503.18742)
Keywords: privacy
Abstract: Document Layout Analysis (DLA) is a fundamental task in document understanding. However, existing DLA and adaptation methods often require access to large-scale source data and target labels. This requirements severely limiting their real-world applicability, particularly in privacy-sensitive and resource-constrained domains, such as financial statements, medical records, and proprietary business documents. According to our observation, directly transferring source-domain fine-tuned models on target domains often results in a significant performance drop (Avg. -32.64%). In this work, we introduce Source-Free Document Layout Analysis (SFDLA), aiming for adapting a pre-trained source DLA models to an unlabeled target domain, without access to any source data. To address this challenge, we establish the first SFDLA benchmark, covering three major DLA datasets for geometric- and content-aware adaptation. Furthermore, we propose Document Layout Analysis Adapter (DLAdapter), a novel framework that is designed to improve source-free adaptation across document domains. Our method achieves a +4.21% improvement over the source-only baseline and a +2.26% gain over existing source-free methods from PubLayNet to DocLayNet. We believe this work will inspire the DLA community to further investigate source-free document understanding. To support future research of the community, the benchmark, models, and code will be publicly available at this https URL.

Title: Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Authors: Yifei Zhang, Chang Liu, Jin Wei, Xiaomeng Yang, Yu Zhou, Can Ma, Xiangyang Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18746
Pdf URL: https://arxiv.org/pdf/2503.18746
Copy Paste: [[2503.18746]] Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition(https://arxiv.org/abs/2503.18746)
Keywords: robust
Abstract: Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Title: Simulation-Driven Balancing of Competitive Game Levels with Reinforcement Learning

Authors: Florian Rupp, Manuel Eberhardinger, Kai Eckert
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18748
Pdf URL: https://arxiv.org/pdf/2503.18748
Copy Paste: [[2503.18748]] Simulation-Driven Balancing of Competitive Game Levels with Reinforcement Learning(https://arxiv.org/abs/2503.18748)
Keywords: robust, fair
Abstract: The balancing process for game levels in competitive two-player contexts involves a lot of manual work and testing, particularly for non-symmetrical game levels. In this work, we frame game balancing as a procedural content generation task and propose an architecture for automatically balancing of tile-based levels within the PCGRL framework (procedural content generation via reinforcement learning). Our architecture is divided into three parts: (1) a level generator, (2) a balancing agent, and (3) a reward modeling simulation. Through repeated simulations, the balancing agent receives rewards for adjusting the level towards a given balancing objective, such as equal win rates for all players. To this end, we propose new swap-based representations to improve the robustness of playability, thereby enabling agents to balance game levels more effectively and quickly compared to traditional PCGRL. By analyzing the agent's swapping behavior, we can infer which tile types have the most impact on the balance. We validate our approach in the Neural MMO (NMMO) environment in a competitive two-player scenario. In this extended conference paper, we present improved results, explore the applicability of the method to various forms of balancing beyond equal balancing, compare the performance to another search-based approach, and discuss the application of existing fairness metrics to game balancing.

Title: Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Authors: Wesley Scivetti, Nathan Schneider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18751
Pdf URL: https://arxiv.org/pdf/2503.18751
Copy Paste: [[2503.18751]] Construction Identification and Disambiguation Using BERT: A Case Study of NPN(https://arxiv.org/abs/2503.18751)
Keywords: transformer
Abstract: Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (''constructions'') that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT's representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction -- exhibited in such expressions as face to face and day to day -- which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction's semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.

Title: EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

Authors: Nathan Darjana, Ryo Fujii, Hideo Saito, Hiroki Kajita
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18755
Pdf URL: https://arxiv.org/pdf/2503.18755
Copy Paste: [[2503.18755]] EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos(https://arxiv.org/abs/2503.18755)
Keywords: segmentation
Abstract: Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at this https URL.

Title: Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI

Authors: Nooshin Bahador
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18762
Pdf URL: https://arxiv.org/pdf/2503.18762
Copy Paste: [[2503.18762]] Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI(https://arxiv.org/abs/2503.18762)
Keywords: robust, interpretability, transformer
Abstract: Mechanistic interpretability improves the safety, reliability, and robustness of large AI models. This study examined individual attention heads in vision transformers (ViTs) fine tuned on distorted 2D spectrogram images containing non relevant content (axis labels, titles, color bars). By introducing extraneous features, the study analyzed how transformer components processed unrelated information, using mechanistic interpretability to debug issues and reveal insights into transformer architectures. Attention maps assessed head contributions across layers. Heads in early layers (1 to 3) showed minimal task impact with ablation increased MSE loss slightly ({\mu}=0.11%, {\sigma}=0.09%), indicating focus on less critical low level features. In contrast, deeper heads (e.g., layer 6) caused a threefold higher loss increase ({\mu}=0.34%, {\sigma}=0.02%), demonstrating greater task importance. Intermediate layers (6 to 11) exhibited monosemantic behavior, attending exclusively to chirp regions. Some early heads (1 to 4) were monosemantic but non task relevant (e.g. text detectors, edge or corner detectors). Attention maps distinguished monosemantic heads (precise chirp localization) from polysemantic heads (multiple irrelevant regions). These findings revealed functional specialization in ViTs, showing how heads processed relevant vs. extraneous information. By decomposing transformers into interpretable components, this work enhanced model understanding, identified vulnerabilities, and advanced safer, more transparent AI.

Title: Good Keypoints for the Two-View Geometry Estimation Problem

Authors: Konstantin Pakulev, Alexander Vakhitov, Gonzalo Ferrer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18767
Pdf URL: https://arxiv.org/pdf/2503.18767
Copy Paste: [[2503.18767]] Good Keypoints for the Two-View Geometry Estimation Problem(https://arxiv.org/abs/2503.18767)
Keywords: robust
Abstract: Local features are essential to many modern downstream applications. Therefore, it is of interest to determine the properties of local features that contribute to the downstream performance for a better design of feature detectors and descriptors. In our work, we propose a new theoretical model for scoring feature points (keypoints) in the context of the two-view geometry estimation problem. The model determines two properties that a good keypoint for solving the homography estimation problem should have: be repeatable and have a small expected measurement error. This result provides key insights into why maximizing the number of correspondences doesn't always lead to better homography estimation accuracy. We use the developed model to design a method that detects keypoints that benefit the homography estimation introducing the Bounded NeSS-ST (BoNeSS-ST) keypoint detector. The novelty of BoNeSS-ST comes from strong theoretical foundations, a more accurate keypoint scoring due to subpixel refinement and a cost designed for superior robustness to low saliency keypoints. As a result, BoNeSS-ST outperforms prior self-supervised local feature detectors in both planar homography and epipolar geometry estimation problems.

Title: AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Authors: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Bui Quang Huy
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18769
Pdf URL: https://arxiv.org/pdf/2503.18769
Copy Paste: [[2503.18769]] AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning(https://arxiv.org/abs/2503.18769)
Keywords: large language model
Abstract: This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation. AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. Experimental results demonstrate that AlphaSpace significantly outperforms existing models on manipulation subtasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet.

Title: Frequency Dynamic Convolution for Dense Image Prediction

Authors: Linwei Chen, Lin Gu, Liang Li, Chenggang Yan, Ying Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18783
Pdf URL: https://arxiv.org/pdf/2503.18783
Copy Paste: [[2503.18783]] Frequency Dynamic Convolution for Dense Image Prediction(https://arxiv.org/abs/2503.18783)
Keywords: transformer, segmentation
Abstract: While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at this https URL.

Title: Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection

Authors: Wenxi Chen, Raymond A. Yeh, Shaoshuai Mou, Yan Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18784
Pdf URL: https://arxiv.org/pdf/2503.18784
Copy Paste: [[2503.18784]] Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection(https://arxiv.org/abs/2503.18784)
Keywords: robust
Abstract: Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for safely deploying deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than in-distribution (IND) inputs. Based on the observation, we propose an adversarial score function that searches for the local minimum scores near the original inputs by applying gradient descent. This procedure enhances the separability between IND and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures. We conduct extensive experiments using the OpenOOD benchmark~\cite{yang2022openood}. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10\% on FPR@95 compared to state-of-the-art methods.

Title: Streaming Federated Learning with Markovian Data

Authors: Tan-Khiem Huynh, Malcolm Egan, Giovanni Neglia, Jean-Marie Gorce
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18807
Pdf URL: https://arxiv.org/pdf/2503.18807
Copy Paste: [[2503.18807]] Streaming Federated Learning with Markovian Data(https://arxiv.org/abs/2503.18807)
Keywords: federate
Abstract: Federated learning (FL) is now recognized as a key framework for communication-efficient collaborative learning. Most theoretical and empirical studies, however, rely on the assumption that clients have access to pre-collected data sets, with limited investigation into scenarios where clients continuously collect data. In many real-world applications, particularly when data is generated by physical or biological processes, client data streams are often modeled by non-stationary Markov processes. Unlike standard i.i.d. sampling, the performance of FL with Markovian data streams remains poorly understood due to the statistical dependencies between client samples over time. In this paper, we investigate whether FL can still support collaborative learning with Markovian data streams. Specifically, we analyze the performance of Minibatch SGD, Local SGD, and a variant of Local SGD with momentum. We answer affirmatively under standard assumptions and smooth non-convex client objectives: the sample complexity is proportional to the inverse of the number of clients with a communication complexity comparable to the i.i.d. scenario. However, the sample complexity for Markovian data streams remains higher than for i.i.d. sampling.

Title: CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

Authors: Yang Liu, Hongjin Wang, Zepu Wang, Xiaoguang Zhu, Jing Liu, Peng Sun, Rui Tang, Jianwei Du, Victor C.M. Leung, Liang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18808
Pdf URL: https://arxiv.org/pdf/2503.18808
Copy Paste: [[2503.18808]] CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos(https://arxiv.org/abs/2503.18808)
Keywords: protect, robust
Abstract: Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

Title: SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection

Authors: Shrikant Malviya, Neelanjan Bhowmik, Stamos Katsigiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18812
Pdf URL: https://arxiv.org/pdf/2503.18812
Copy Paste: [[2503.18812]] SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection(https://arxiv.org/abs/2503.18812)
Keywords: robust, diffusion, transformer
Abstract: The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.

Title: Defeating Prompt Injections by Design

Authors: Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18813
Pdf URL: https://arxiv.org/pdf/2503.18813
Copy Paste: [[2503.18813]] Defeating Prompt Injections by Design(https://arxiv.org/abs/2503.18813)
Keywords: security, protect, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving $67\%$ of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

Title: Interpretable and Fair Mechanisms for Abstaining Classifiers

Authors: Daphne Lenders, Andrea Pugnana, Roberto Pellungrini, Toon Calders, Dino Pedreschi, Fosca Giannotti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18826
Pdf URL: https://arxiv.org/pdf/2503.18826
Copy Paste: [[2503.18826]] Interpretable and Fair Mechanisms for Abstaining Classifiers(https://arxiv.org/abs/2503.18826)
Keywords: fair
Abstract: Abstaining classifiers have the option to refrain from providing a prediction for instances that are difficult to classify. The abstention mechanism is designed to trade off the classifier's performance on the accepted data while ensuring a minimum number of predictions. In this setting, often fairness concerns arise when the abstention mechanism solely reduces errors for the majority groups of the data, resulting in increased performance differences across demographic groups. While there exist a bunch of methods that aim to reduce discrimination when abstaining, there is no mechanism that can do so in an explainable way. In this paper, we fill this gap by introducing Interpretable and Fair Abstaining Classifier IFAC, an algorithm that can reject predictions both based on their uncertainty and their unfairness. By rejecting possibly unfair predictions, our method reduces error and positive decision rate differences across demographic groups of the non-rejected data. Since the unfairness-based rejections are based on an interpretable-by-design method, i.e., rule-based fairness checks and situation testing, we create a transparent process that can empower human decision-makers to review the unfair predictions and make more just decisions for them. This explainable aspect is especially important in light of recent AI regulations, mandating that any high-risk decision task should be overseen by human experts to reduce discrimination risks.

Title: Unsupervised Detection of Fraudulent Transactions in E-commerce Using Contrastive Learning

Authors: Xuan Li, Yuting Peng, Xiaoxuan Sun, Yifei Duan, Zhou Fang, Tengda Tang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18841
Pdf URL: https://arxiv.org/pdf/2503.18841
Copy Paste: [[2503.18841]] Unsupervised Detection of Fraudulent Transactions in E-commerce Using Contrastive Learning(https://arxiv.org/abs/2503.18841)
Keywords: security, robust
Abstract: With the rapid development of e-commerce, e-commerce platforms are facing an increasing number of fraud threats. Effectively identifying and preventing these fraudulent activities has become a critical research problem. Traditional fraud detection methods typically rely on supervised learning, which requires large amounts of labeled data. However, such data is often difficult to obtain, and the continuous evolution of fraudulent activities further reduces the adaptability and effectiveness of traditional methods. To address this issue, this study proposes an unsupervised e-commerce fraud detection algorithm based on SimCLR. The algorithm leverages the contrastive learning framework to effectively detect fraud by learning the underlying representations of transaction data in an unlabeled setting. Experimental results on the eBay platform dataset show that the proposed algorithm outperforms traditional unsupervised methods such as K-means, Isolation Forest, and Autoencoders in terms of accuracy, precision, recall, and F1 score, demonstrating strong fraud detection capabilities. The results confirm that the SimCLR-based unsupervised fraud detection method has broad application prospects in e-commerce platform security, improving both detection accuracy and robustness. In the future, with the increasing scale and diversity of datasets, the model's performance will continue to improve, and it could be integrated with real-time monitoring systems to provide more efficient security for e-commerce platforms.

Title: Secure Edge Computing Reference Architecture for Data-driven Structural Health Monitoring: Lessons Learned from Implementation and Benchmarking

Authors: Sheikh Muhammad Farjad, Sandeep Reddy Patllola, Yonas Kassa, George Grispos, Robin Gandhi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.18857
Pdf URL: https://arxiv.org/pdf/2503.18857
Copy Paste: [[2503.18857]] Secure Edge Computing Reference Architecture for Data-driven Structural Health Monitoring: Lessons Learned from Implementation and Benchmarking(https://arxiv.org/abs/2503.18857)
Keywords: secure
Abstract: Structural Health Monitoring (SHM) plays a crucial role in maintaining aging and critical infrastructure, supporting applications such as smart cities and digital twinning. These applications demand machine learning models capable of processing large volumes of real-time sensor data at the network edge. However, existing approaches often neglect the challenges of deploying machine learning models at the edge or are constrained by vendor-specific platforms. This paper introduces a scalable and secure edge-computing reference architecture tailored for data-driven SHM. We share practical insights from deploying this architecture at the Memorial Bridge in New Hampshire, US, referred to as the Living Bridge project. Our solution integrates a commercial data acquisition system with off-the-shelf hardware running an open-source edge-computing platform, remotely managed and scaled through cloud services. To support the development of data-driven SHM systems, we propose a resource consumption benchmarking framework called edgeOps to evaluate the performance of machine learning models on edge devices. We study this framework by collecting resource utilization data for machine learning models typically used in SHM applications on two different edge computing hardware platforms. edgeOps was specifically studied on off-the-shelf Linux and ARM-based edge devices. Our findings demonstrate the impact of platform and model selection on system performance, providing actionable guidance for edge-based SHM system design.

Title: An End-to-End GSM/SMS Encrypted Approach for Smartphone Employing Advanced Encryption Standard(AES)

Authors: Wasim Abbas, Salaki Reynaldo Joshua, Asim Abbas, Je-Hoon Lee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.18859
Pdf URL: https://arxiv.org/pdf/2503.18859
Copy Paste: [[2503.18859]] An End-to-End GSM/SMS Encrypted Approach for Smartphone Employing Advanced Encryption Standard(AES)(https://arxiv.org/abs/2503.18859)
Keywords: secure, security, protect
Abstract: Encryption is crucial for securing sensitive data during transmission over networks. Various encryption techniques exist, such as AES, DES, and RC4, with AES being the most renowned algorithm. We proposed methodology that enables users to encrypt text messages for secure transmission over cellular networks. This approach utilizes the AES algorithm following the proposed protocols for encryption and decryption, ensuring fast and reliable data protection. This approach ensures secure text encryption and enables users to enter messages that are encrypted using a key at the sender's end and decrypted at the recipient's end, which is compatible with any Android device. SMS are encrypted with the AES algorithm, making them resistant to brute-force attempts. As SMS has become a popular form of communication, protecting personal data, email alerts, banking details, and transactions information. It addresses security concerns by encrypting messages using AES and cryptographic techniques, providing an effective solution for protecting sensitive data during SMS exchanges.

Title: HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation

Authors: Zunnan Xu, Zhentao Yu, Zixiang Zhou, Jun Zhou, Xiaoyu Jin, Fa-Ting Hong, Xiaozhong Ji, Junwei Zhu, Chengfei Cai, Shiyu Tang, Qin Lin, Xiu Li, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18860
Pdf URL: https://arxiv.org/pdf/2503.18860
Copy Paste: [[2503.18860]] HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation(https://arxiv.org/abs/2503.18860)
Keywords: diffusion
Abstract: We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at this https URL.

Title: Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

Authors: DeShin Hwa, Tobias Holmes, Klaus Drechsler
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18862
Pdf URL: https://arxiv.org/pdf/2503.18862
Copy Paste: [[2503.18862]] Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation(https://arxiv.org/abs/2503.18862)
Keywords: transformer, segmentation
Abstract: While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.

Title: A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery

Authors: Runze Cheng, Yao Sun, Lan Zhang, Lei Feng, Lei Zhang, Muhammad Ali Imran
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18874
Pdf URL: https://arxiv.org/pdf/2503.18874
Copy Paste: [[2503.18874]] A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery(https://arxiv.org/abs/2503.18874)
Keywords: diffusion, generative
Abstract: With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction. However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks. Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation. Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.

Title: I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Authors: Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.18878
Pdf URL: https://arxiv.org/pdf/2503.18878
Copy Paste: [[2503.18878]] I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders(https://arxiv.org/abs/2503.18878)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at this https URL

Title: Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Authors: Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.18880
Pdf URL: https://arxiv.org/pdf/2503.18880
Copy Paste: [[2503.18880]] Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes(https://arxiv.org/abs/2503.18880)
Keywords: segmentation
Abstract: We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.

Title: Efficient and Accurate Scene Text Recognition with Cascaded-Transformers

Authors: Savas Ozkan, Andrea Maracani, Hyowon Kim, Sijun Cho, Eunchung Noh, Jeongwon Min, Jung Min Cho, Mete Ozay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18883
Pdf URL: https://arxiv.org/pdf/2503.18883
Copy Paste: [[2503.18883]] Efficient and Accurate Scene Text Recognition with Cascaded-Transformers(https://arxiv.org/abs/2503.18883)
Keywords: transformer
Abstract: In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.

Title: CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

Authors: Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18886
Pdf URL: https://arxiv.org/pdf/2503.18886
Copy Paste: [[2503.18886]] CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models(https://arxiv.org/abs/2503.18886)
Keywords: diffusion
Abstract: Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at this http URL)

Title: AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

Authors: Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18891
Pdf URL: https://arxiv.org/pdf/2503.18891
Copy Paste: [[2503.18891]] AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration(https://arxiv.org/abs/2503.18891)
Keywords: robust, large language model
Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at this https URL.

Title: xKV: Cross-Layer SVD for KV-Cache Compression

Authors: Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18893
Pdf URL: https://arxiv.org/pdf/2503.18893
Copy Paste: [[2503.18893]] xKV: Cross-Layer SVD for KV-Cache Compression(https://arxiv.org/abs/2503.18893)
Keywords: large language model
Abstract: Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: this https URL.

Title: Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection

Authors: Moussa Kassem Sbeyti, Nadja Klein, Azarm Nowzad, Fikret Sivrikaya, Sahin Albayrak
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18903
Pdf URL: https://arxiv.org/pdf/2503.18903
Copy Paste: [[2503.18903]] Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection(https://arxiv.org/abs/2503.18903)
Keywords: robust
Abstract: Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at this https URL.

Title: FFN Fusion: Rethinking Sequential Computation in Large Language Models

Authors: Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18908
Pdf URL: https://arxiv.org/pdf/2503.18908
Copy Paste: [[2503.18908]] FFN Fusion: Rethinking Sequential Computation in Large Language Models(https://arxiv.org/abs/2503.18908)
Keywords: transformer, large language model
Abstract: We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

Title: Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Authors: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18929
Pdf URL: https://arxiv.org/pdf/2503.18929
Copy Paste: [[2503.18929]] Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training(https://arxiv.org/abs/2503.18929)
Keywords: large language model
Abstract: Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

Title: CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Authors: Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18931
Pdf URL: https://arxiv.org/pdf/2503.18931
Copy Paste: [[2503.18931]] CoMP: Continual Multimodal Pre-training for Vision Foundation Models(https://arxiv.org/abs/2503.18931)
Keywords: segmentation
Abstract: Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

Title: SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

Authors: Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, Juergen Gall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18933
Pdf URL: https://arxiv.org/pdf/2503.18933
Copy Paste: [[2503.18933]] SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction(https://arxiv.org/abs/2503.18933)
Keywords: robust, diffusion
Abstract: Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.

Title: Training-free Diffusion Acceleration with Bottleneck Sampling

Authors: Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18940
Pdf URL: https://arxiv.org/pdf/2503.18940
Copy Paste: [[2503.18940]] Training-free Diffusion Acceleration with Bottleneck Sampling(https://arxiv.org/abs/2503.18940)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: this https URL

Title: Video-T1: Test-Time Scaling for Video Generation

Authors: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18942
Pdf URL: https://arxiv.org/pdf/2503.18942
Copy Paste: [[2503.18942]] Video-T1: Test-Time Scaling for Video Generation(https://arxiv.org/abs/2503.18942)
Keywords: large language model
Abstract: With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: this https URL

Title: SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Authors: Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18943
Pdf URL: https://arxiv.org/pdf/2503.18943
Copy Paste: [[2503.18943]] SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding(https://arxiv.org/abs/2503.18943)
Keywords: robust, large language model
Abstract: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs. We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes. Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.

Title: DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation

Authors: Karim Abou Zeid, Kadir Yilmaz, Daan de Geus, Alexander Hermans, David Adrian, Timm Linder, Bastian Leibe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18944
Pdf URL: https://arxiv.org/pdf/2503.18944
Copy Paste: [[2503.18944]] DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation(https://arxiv.org/abs/2503.18944)
Keywords: segmentation
Abstract: Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we further propose to distill 2D foundation models into a 3D backbone as a pretraining task. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.

Title: Aether: Geometric-Aware Unified World Modeling

Authors: Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18945
Pdf URL: https://arxiv.org/pdf/2503.18945
Copy Paste: [[2503.18945]] Aether: Geometric-Aware Unified World Modeling(https://arxiv.org/abs/2503.18945)
Keywords: generative
Abstract: The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Title: Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Authors: Jae Joong Lee, Bedrich Benes, Raymond A. Yeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18947
Pdf URL: https://arxiv.org/pdf/2503.18947
Copy Paste: [[2503.18947]] Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models(https://arxiv.org/abs/2503.18947)
Keywords: robust, diffusion, segmentation
Abstract: Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the "occlusion-free bias" of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.

Title: Equivariant Image Modeling

Authors: Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18948
Pdf URL: https://arxiv.org/pdf/2503.18948
Copy Paste: [[2503.18948]] Equivariant Image Modeling(https://arxiv.org/abs/2503.18948)
Keywords: diffusion, generative
Abstract: Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at this https URL.

Title: Target-Aware Video Diffusion Models

Authors: Taeksoo Kim, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18950
Pdf URL: https://arxiv.org/pdf/2503.18950
Copy Paste: [[2503.18950]] Target-Aware Video Diffusion Models(https://arxiv.org/abs/2503.18950)
Keywords: diffusion, transformer, segmentation
Abstract: We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.