2025-03-18

Title: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Authors: Donghao Huang, Zhaoxia Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11655
Pdf URL: https://arxiv.org/pdf/2503.11655
Copy Paste: [[2503.11655]] Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning(https://arxiv.org/abs/2503.11655)
Keywords: interpretability, explainability, large language model
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced sentiment analysis capabilities. However, the trade-offs between model performance, efficiency, and explainability of some latest models remain underexplored. This study presents the first comprehensive evaluation of the DeepSeek-R1 series of models, reasoning open-source LLMs, for sentiment analysis, comparing them against OpenAI's GPT-4 and GPT-4-mini. We systematically analyze their performance under few-shot prompting conditions, scaling up to 50-shot configurations to assess in-context learning effectiveness. Our experiments reveal that DeepSeek-R1 demonstrates competitive accuracy, particularly in multi-class sentiment tasks, while offering enhanced interpretability through its detailed reasoning process. Additionally, we highlight the impact of increasing few-shot examples on model performance and discuss key trade-offs between explainability and computational efficiency.

Title: TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models

Authors: Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O'Brien, Vasu Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11656
Pdf URL: https://arxiv.org/pdf/2503.11656
Copy Paste: [[2503.11656]] TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models(https://arxiv.org/abs/2503.11656)
Keywords: large language model
Abstract: Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.

Title: Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Authors: Vincent Li, Yule Fu, Tim Knappe, Kevin Han, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11657
Pdf URL: https://arxiv.org/pdf/2503.11657
Copy Paste: [[2503.11657]] Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs(https://arxiv.org/abs/2503.11657)
Keywords: large language model
Abstract: Large Language Models have demonstrated remarkable capabilities in natural language processing tasks, including mathematical problem-solving that requires multi-step logical reasoning. However, challenges persist in automating the identification of key mathematical concepts, understanding their interrelations, and formalizing proofs within a rigorous framework. We present a novel framework that leverages knowledge graphs to augment LLMs to construct and formalize mathematical proofs. Our results demonstrate significant performance improvements across multiple datasets, with using knowledge graphs, achieving up to a 34% success rate on the MUSTARDSAUCE dataset on o1-mini and consistently outperforming baseline approaches by 2-11% across different models. We show how this approach bridges the gap between natural language understanding and formal logic proof systems and achieve elevated results for foundation models over baseline.

Title: Zero Trust Architecture: A Systematic Literature Review

Authors: Muhammad Liman Gambo, Ahmad Almulhem
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.11659
Pdf URL: https://arxiv.org/pdf/2503.11659
Copy Paste: [[2503.11659]] Zero Trust Architecture: A Systematic Literature Review(https://arxiv.org/abs/2503.11659)
Keywords: security
Abstract: The increasing complexity of digital ecosystems and evolving cybersecurity threats have highlighted the limitations of traditional perimeter-based security models, leading to the growing adoption of Zero Trust Architecture (ZTA). ZTA operates on the principle of "never trust, always verify", enforcing continuous authentication, conditional access, dynamic trust evaluation, and the principle of least privilege to enhance security across diverse domains. This study applies the PRISMA framework to analyze 10 years of research (2016-2025) on ZTA, presenting a systematic literature review (SLR) that synthesizes its applications, enabling technologies, and associated challenges. It provides a detailed taxonomy that organizes ZTA's application domains, together with the emerging technologies that facilitate its implementation, and critically examines the barriers to ZTA adoption. Additionally, the study traces the historical evolution of ZTA alongside notable events and publications trends while highlighting some potential factors for the surge over the past few years. This comprehensive analysis serves as a practical guide for researchers and practitioners seeking to leverage ZTA for stronger, more adaptive security frameworks in a rapidly shifting threat landscape.

Title: LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models

Authors: Zhenyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11667
Pdf URL: https://arxiv.org/pdf/2503.11667
Copy Paste: [[2503.11667]] LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models(https://arxiv.org/abs/2503.11667)
Keywords: transformer, large language model
Abstract: This paper introduces LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models. While Logit Lens has been a crucial method for understanding internal representations of language models, it was previously limited to earlier model architectures. Our work overcomes the limitations of existing implementations, enabling the technique to be applied to state-of-the-art architectures (such as Qwen-2.5 and Llama-3.1) while automating key analytical workflows. By developing component-specific hooks to capture both attention mechanisms and MLP outputs, our implementation achieves full compatibility with the HuggingFace transformer library while maintaining low inference overhead. The toolkit provides both interactive exploration and batch processing capabilities, supporting large-scale layer-wise analyses. Through open-sourcing our implementation, we aim to facilitate deeper investigations into the internal mechanisms of large-scale language models. The toolkit is openly available at this https URL.

Title: MELON: Multimodal Mixture-of-Experts with Spectral-Temporal Fusion for Long-Term Mobility Estimation in Critical Care

Authors: Jiaqing Zhang, Miguel Contreras, Jessica Sena, Andrea Davidson, Yuanfang Ren, Ziyuan Guan, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Subhash Nerella, Azra Bihorac, Parisa Rashidi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11695
Pdf URL: https://arxiv.org/pdf/2503.11695
Copy Paste: [[2503.11695]] MELON: Multimodal Mixture-of-Experts with Spectral-Temporal Fusion for Long-Term Mobility Estimation in Critical Care(https://arxiv.org/abs/2503.11695)
Keywords: robust, extraction
Abstract: Patient mobility monitoring in intensive care is critical for ensuring timely interventions and improving clinical outcomes. While accelerometry-based sensor data are widely adopted in training artificial intelligence models to estimate patient mobility, existing approaches face two key limitations highlighted in clinical practice: (1) modeling the long-term accelerometer data is challenging due to the high dimensionality, variability, and noise, and (2) the absence of efficient and robust methods for long-term mobility assessment. To overcome these challenges, we introduce MELON, a novel multimodal framework designed to predict 12-hour mobility status in the critical care setting. MELON leverages the power of a dual-branch network architecture, combining the strengths of spectrogram-based visual representations and sequential accelerometer statistical features. MELON effectively captures global and fine-grained mobility patterns by integrating a pre-trained image encoder for rich frequency-domain feature extraction and a Mixture-of-Experts encoder for sequence modeling. We trained and evaluated the MELON model on the multimodal dataset of 126 patients recruited from nine Intensive Care Units at the University of Florida Health Shands Hospital main campus in Gainesville, Florida. Experiments showed that MELON outperforms conventional approaches for 12-hour mobility status estimation with an overall area under the receiver operating characteristic curve (AUROC) of 0.82 (95\%, confidence interval 0.78-0.86). Notably, our experiments also revealed that accelerometer data collected from the wrist provides robust predictive performance compared with data from the ankle, suggesting a single-sensor solution that can reduce patient burden and lower deployment costs...

Title: Generalization of Video-Based Heart Rate Estimation Methods To Low Illumination and Elevated Heart Rates

Authors: Bhargav Acharya, William Saakyan, Barbara Hammer, Hanna Drimalla
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11697
Pdf URL: https://arxiv.org/pdf/2503.11697
Copy Paste: [[2503.11697]] Generalization of Video-Based Heart Rate Estimation Methods To Low Illumination and Elevated Heart Rates(https://arxiv.org/abs/2503.11697)
Keywords: robust
Abstract: Heart rate is a physiological signal that provides information about an individual's health and affective state. Remote photoplethysmography (rPPG) allows the estimation of this signal from video recordings of a person's face. Classical rPPG methods make use of signal processing techniques, while recent rPPG methods utilize deep learning networks. Methods are typically evaluated on datasets collected in well-lit environments with participants at resting heart rates. However, little investigation has been done on how well these methods adapt to variations in illumination and heart rate. In this work, we systematically evaluate representative state-of-the-art methods for remote heart rate estimation. Specifically, we evaluate four classical methods and four deep learning-based rPPG estimation methods in terms of their generalization ability to changing scenarios, including low lighting conditions and elevated heart rates. For a thorough evaluation of existing approaches, we collected a novel dataset called CHILL, which systematically varies heart rate and lighting conditions. The dataset consists of recordings from 45 participants in four different scenarios. The video data was collected under two different lighting conditions (high and low) and normal and elevated heart rates. In addition, we selected two public datasets to conduct within- and cross-dataset evaluations of the rPPG methods. Our experimental results indicate that classical methods are not significantly impacted by low-light conditions. Meanwhile, some deep learning methods were found to be more robust to changes in lighting conditions but encountered challenges in estimating high heart rates. The cross-dataset evaluation revealed that the selected deep learning methods underperformed when influencing factors such as elevated heart rates and low lighting conditions were not present in the training set.

Title: A Survey of Direct Preference Optimization

Authors: Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11701
Pdf URL: https://arxiv.org/pdf/2503.11701
Copy Paste: [[2503.11701]] A Survey of Direct Preference Optimization(https://arxiv.org/abs/2503.11701)
Keywords: robust, generative, large language model
Abstract: Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at this https URL.

Title: Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering

Authors: Fabian Galis, Darian Onchis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11706
Pdf URL: https://arxiv.org/pdf/2503.11706
Copy Paste: [[2503.11706]] Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering(https://arxiv.org/abs/2503.11706)
Keywords: explainability
Abstract: In the context of unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. However, the success of clustering algorithms often depends on the relevance and contribution of features, which can differ between various datasets. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations), a technique commonly used for providing explainability in various supervised machine learning tasks. By taking advantage of SHAP values in a way other than just to gain explainability, we use them to weight features and ultimately improve the clustering process itself in unsupervised scenarios. Our empirical evaluations across five benchmark datasets and clustering methods demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69\% improvement over other weighting methods (from 0.586 to 0.719 in terms of the Adjusted Rand Index). Additionally, these situations where the weighted data boosts the results are highlighted and thoroughly explored, offering insight for practical applications.

Title: Privacy-Preserved Automated Scoring using Federated Learning for Educational Research

Authors: Ehsan Latif, Xiaoming Zhai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11711
Pdf URL: https://arxiv.org/pdf/2503.11711
Copy Paste: [[2503.11711]] Privacy-Preserved Automated Scoring using Federated Learning for Educational Research(https://arxiv.org/abs/2503.11711)
Keywords: privacy, robust, federate
Abstract: Data privacy remains a critical concern in educational research, necessitating Institutional Review Board (IRB) certification and stringent data handling protocols to ensure compliance with ethical standards. Traditional approaches rely on anonymization and controlled data-sharing mechanisms to facilitate research while mitigating privacy risks. However, these methods still involve direct access to raw student data, posing potential vulnerabilities and being time-consuming. This study proposes a federated learning (FL) framework for automatic scoring in educational assessments, eliminating the need to share raw data. Our approach leverages client-side model training, where student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server. To effectively aggregate heterogeneous model updates, we introduce an adaptive weighted averaging strategy, which dynamically adjusts weight contributions based on client-specific learning characteristics. This method ensures robust model convergence while preserving privacy. We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models. A statistical significance test (paired t-test, $t(8) = 2.29, p = 0.051$) confirms that the accuracy difference between the two approaches is not statistically significant, demonstrating that federated learning achieves comparable performance while safeguarding student data. Furthermore, our method significantly reduces data collection, processing, and deployment overhead, accelerating the adoption of AI-driven educational assessments in a privacy-compliant manner.

Title: Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

Authors: Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11720
Pdf URL: https://arxiv.org/pdf/2503.11720
Copy Paste: [[2503.11720]] Fine-Tuning Diffusion Generative Models via Rich Preference Optimization(https://arxiv.org/abs/2503.11720)
Keywords: diffusion, generative
Abstract: We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.

Title: BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction

Authors: Zekai Zhang, Dan Li, Shunyu Wu, Junya Cai, Bo Zhang, See Kiong Ng, Zibin Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11730
Pdf URL: https://arxiv.org/pdf/2503.11730
Copy Paste: [[2503.11730]] BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction(https://arxiv.org/abs/2503.11730)
Keywords: generative
Abstract: Prognostic and Health Management (PHM) are crucial ways to avoid unnecessary maintenance for Cyber-Physical Systems (CPS) and improve system reliability. Predicting the Remaining Useful Life (RUL) is one of the most challenging tasks for PHM. Existing methods require prior knowledge about the system, contrived assumptions, or temporal mining to model the life cycles of machine equipment/devices, resulting in diminished accuracy and limited applicability in real-world scenarios. This paper proposes a Bi-directional Adversarial network with Covariate Encoding for machine Remaining Useful Life (BACE-RUL) prediction, which only adopts sensor measurements from the current life cycle to predict RUL rather than relying on previous consecutive cycle recordings. The current sensor measurements of mechanical devices are encoded to a conditional space to better understand the implicit inner mechanical status. The predictor is trained as a conditional generative network with the encoded sensor measurements as its conditions. Various experiments on several real-world datasets, including the turbofan aircraft engine dataset and the dataset collected from degradation experiments of Li-Ion battery cells, show that the proposed model is a general framework and outperforms state-of-the-art methods.

Title: Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation

Authors: Xianming Zeng, Sicong Du, Qifeng Chen, Lizhe Liu, Haoyu Shu, Jiaxuan Gao, Jiarun Liu, Jiulong Xu, Jianyun Xu, Mingxia Chen, Yiru Zhao, Peng Chen, Yapeng Xue, Chunming Zhao, Sheng Yang, Qiang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11731
Pdf URL: https://arxiv.org/pdf/2503.11731
Copy Paste: [[2503.11731]] Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation(https://arxiv.org/abs/2503.11731)
Keywords: diffusion
Abstract: Sensor simulation is pivotal for scalable validation of autonomous driving systems, yet existing Neural Radiance Fields (NeRF) based methods face applicability and efficiency challenges in industrial workflows. This paper introduces a Gaussian Splatting (GS) based system to address these challenges: We first break down sensor simulator components and analyze the possible advantages of GS over NeRF. Then in practice, we refactor three crucial components through GS, to leverage its explicit scene representation and real-time rendering: (1) choosing the 2D neural Gaussian representation for physics-compliant scene and sensor modeling, (2) proposing a scene editing pipeline to leverage Gaussian primitives library for data augmentation, and (3) coupling a controllable diffusion model for scene expansion and harmonization. We implement this framework on a proprietary autonomous driving dataset supporting cameras and LiDAR sensors. We demonstrate through ablation studies that our approach reduces frame-wise simulation latency, achieves better geometric and photometric consistency, and enables interpretable explicit scene editing and expansion. Furthermore, we showcase how integrating such a GS-based sensor simulator with traffic and dynamic simulators enables full-stack testing of end-to-end autonomy algorithms. Our work provides both algorithmic insights and practical validation, establishing GS as a cornerstone for industrial-grade sensor simulation.

Title: CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

Authors: Zirui Yuan, Siqi Lai, Hao Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11739
Pdf URL: https://arxiv.org/pdf/2503.11739
Copy Paste: [[2503.11739]] CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control(https://arxiv.org/abs/2503.11739)
Keywords: robust, large language model
Abstract: Traffic Signal Control (TSC) plays a critical role in urban traffic management by optimizing traffic flow and mitigating congestion. While Large Language Models (LLMs) have recently emerged as promising tools for TSC due to their exceptional problem-solving and generalization capabilities, existing approaches fail to address the essential need for inter-agent coordination, limiting their effectiveness in achieving network-wide optimization. To bridge this gap, we propose CoLLMLight, a cooperative LLM agent framework for TSC. Specifically, we first construct a structured spatiotemporal graph to capture real-time traffic dynamics and spatial relationships among neighboring intersections, enabling the LLM to reason about complex traffic interactions. Moreover, we introduce a complexity-aware reasoning mechanism that dynamically adapts reasoning depth based on real-time traffic conditions, ensuring optimal computational efficiency without sacrificing decision quality. Besides, we propose a fine-tuning strategy that leverages iterative simulation-driven data collection and environmental feedback to build a lightweight LLM tailored for cooperative TSC. Extensive experiments on both synthetic and real-world datasets demonstrate that CoLLMLight outperforms state-of-the-art methods in diverse traffic scenarios, showcasing its effectiveness, scalability, and robustness.

Title: BioMamba: Leveraging Spectro-Temporal Embedding in Bidirectional Mamba for Enhanced Biosignal Classification

Authors: Jian Qian, Teck Lun Goh, Bingyu Xie, Chengyao Zhu, Biao Wan, Yawen Guan, Patrick Yin Chiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11741
Pdf URL: https://arxiv.org/pdf/2503.11741
Copy Paste: [[2503.11741]] BioMamba: Leveraging Spectro-Temporal Embedding in Bidirectional Mamba for Enhanced Biosignal Classification(https://arxiv.org/abs/2503.11741)
Keywords: robust
Abstract: Biological signals, such as electroencephalograms (EEGs) and electrocardiograms (ECGs), play a pivotal role in numerous clinical practices, such as diagnosing brain and cardiac arrhythmic diseases. Existing methods for biosignal classification rely on Attention-based frameworks with dense Feed Forward layers, which lead to inefficient learning, high computational overhead, and suboptimal performance. In this work, we introduce BioMamba, a Spectro-Temporal Embedding strategy applied to the Bidirectional Mamba framework with Sparse Feed Forward layers to enable effective learning of biosignal sequences. By integrating these three key components, BioMamba effectively addresses the limitations of existing methods. Extensive experiments demonstrate that BioMamba significantly outperforms state-of-the-art methods with marked improvement in classification performance. The advantages of the proposed BioMamba include (1) Reliability: BioMamba consistently delivers robust results, confirmed across six evaluation metrics. (2) Efficiency: We assess both model and training efficiency, the BioMamba demonstrates computational effectiveness by reducing model size and resource consumption compared to existing approaches. (3) Generality: With the capacity to effectively classify a diverse set of tasks, BioMamba demonstrates adaptability and effectiveness across various domains and applications.

Title: Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization

Authors: Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, Yujun Cai
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.11750
Pdf URL: https://arxiv.org/pdf/2503.11750
Copy Paste: [[2503.11750]] Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization(https://arxiv.org/abs/2503.11750)
Keywords: defense, attack
Abstract: In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.

Title: reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

Authors: Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11751
Pdf URL: https://arxiv.org/pdf/2503.11751
Copy Paste: [[2503.11751]] reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs(https://arxiv.org/abs/2503.11751)
Keywords: robust
Abstract: Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.

Title: UBMF: Uncertainty-Aware Bayesian Meta-Learning Framework for Fault Diagnosis with Imbalanced Industrial Data

Authors: Zhixuan Lian, Shangyu Li, Qixuan Huang, Zijian Huang, Haifei Liu, Jianan Qiu, Puyu Yang, Laifa Tao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11774
Pdf URL: https://arxiv.org/pdf/2503.11774
Copy Paste: [[2503.11774]] UBMF: Uncertainty-Aware Bayesian Meta-Learning Framework for Fault Diagnosis with Imbalanced Industrial Data(https://arxiv.org/abs/2503.11774)
Keywords: robust, extraction
Abstract: Fault diagnosis of mechanical equipment involves data collection, feature extraction, and pattern recognition but is often hindered by the imbalanced nature of industrial data, introducing significant uncertainty and reducing diagnostic reliability. To address these challenges, this study proposes the Uncertainty-Aware Bayesian Meta-Learning Framework (UBMF), which integrates four key modules: data perturbation injection for enhancing feature robustness, cross-task self-supervised feature extraction for improving transferability, uncertainty-based sample filtering for robust out-of-domain generalization, and Bayesian meta-knowledge integration for fine-grained classification. Experimental results on ten open-source datasets under various imbalanced conditions, including cross-task, small-sample, and unseen-sample scenarios, demonstrate the superiority of UBMF, achieving an average improvement of 42.22% across ten Any-way 1-5-shot diagnostic tasks. This integrated framework effectively enhances diagnostic accuracy, generalization, and adaptability, providing a reliable solution for complex industrial fault diagnosis.

Title: Enhancing Resiliency of Sketch-based Security via LSB Sharing-based Dynamic Late Merging

Authors: Seungsam Yang, Seyed Mohammad Mehdi Mirnajafizadeh, Sian Kim, Rhongho Jang, DaeHun Nyang
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2503.11777
Pdf URL: https://arxiv.org/pdf/2503.11777
Copy Paste: [[2503.11777]] Enhancing Resiliency of Sketch-based Security via LSB Sharing-based Dynamic Late Merging(https://arxiv.org/abs/2503.11777)
Keywords: security, attack
Abstract: With the exponentially growing Internet traffic, sketch data structure with a probabilistic algorithm has been expected to be an alternative solution for non-compromised (non-selective) security monitoring. While facilitating counting within a confined memory space, the sketch's memory efficiency and accuracy were further pushed to their limit through finer-grained and dynamic control of constrained memory space to adapt to the data stream's inherent skewness (i.e., Zipf distribution), namely small counters with extensions. In this paper, we unveil a vulnerable factor of the small counter design by introducing a new sketch-oriented attack, which threatens a stream of state-of-the-art sketches and their security applications. With the root cause analyses, we propose Siamese Counter with enhanced adversarial resiliency and verified feasibility with extensive experimental and theoretical analyses. Under a sketch pollution attack, Siamese Counter delivers 47% accurate results than a state-of-the-art scheme, and demonstrates up to 82% more accurate estimation under normal measurement scenarios.

Title: Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

Authors: Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11780
Pdf URL: https://arxiv.org/pdf/2503.11780
Copy Paste: [[2503.11780]] Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning(https://arxiv.org/abs/2503.11780)
Keywords: extraction
Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.

Title: StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

Authors: Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11792
Pdf URL: https://arxiv.org/pdf/2503.11792
Copy Paste: [[2503.11792]] StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model(https://arxiv.org/abs/2503.11792)
Keywords: generative
Abstract: For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: this https URL.

Title: Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Authors: Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, Muhao Chen
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11794
Pdf URL: https://arxiv.org/pdf/2503.11794
Copy Paste: [[2503.11794]] Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection(https://arxiv.org/abs/2503.11794)
Keywords: large language model
Abstract: Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to excel in vision-language tasks such as visual question answering (VQA). To improve fine-grained visual reasoning, recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model. However, this approach significantly increases the number of visual tokens, leading to inefficiency and potential distractions for the LLM. To address the generalization challenges of image representation in VLMs, we propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details. Our method leverages textual semantics to identify key visual areas, improving VQA performance without requiring any retraining of the VLM. Additionally, it incorporates textual signals into the visual encoding process, enhancing both efficiency and effectiveness. The proposed method, SEMCLIP, strengthens the visual understanding of a 7B VLM, LLaVA-1.5 by 3.3% on average across 7 benchmarks, and particularly by 5.3% on the challenging detailed understanding benchmark V*.

Title: Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques

Authors: Neusha Javidnia, Bita Darvish Rouhani, Farinaz Koushanfar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11816
Pdf URL: https://arxiv.org/pdf/2503.11816
Copy Paste: [[2503.11816]] Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques(https://arxiv.org/abs/2503.11816)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens, presenting significant efficiency challenges. This paper presents an analysis of various Key-Value (KV) cache compression strategies, offering a comprehensive taxonomy that categorizes these methods by their underlying principles and implementation techniques. Furthermore, we evaluate their impact on performance and inference latency, providing critical insights into their effectiveness. Our findings highlight the trade-offs involved in KV cache compression and its influence on handling long-context scenarios, paving the way for more efficient LLM implementations.

Title: Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

Authors: Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11827
Pdf URL: https://arxiv.org/pdf/2503.11827
Copy Paste: [[2503.11827]] Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring(https://arxiv.org/abs/2503.11827)
Keywords: fair, generative, large language model
Abstract: Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.

Title: Performance Analysis of Decentralized Federated Learning Deployments

Authors: Chengyan Jiang, Jiamin Fan, Talal Halabi, Israat Haque
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2503.11828
Pdf URL: https://arxiv.org/pdf/2503.11828
Copy Paste: [[2503.11828]] Performance Analysis of Decentralized Federated Learning Deployments(https://arxiv.org/abs/2503.11828)
Keywords: privacy, robust, federate, large language model
Abstract: The widespread adoption of smartphones and smart wearable devices has led to the widespread use of Centralized Federated Learning (CFL) for training powerful machine learning models while preserving data privacy. However, CFL faces limitations due to its overreliance on a central server, which impacts latency and system robustness. Decentralized Federated Learning (DFL) is introduced to address these challenges. It facilitates direct collaboration among participating devices without relying on a central server. Each device can independently connect with other devices and share model parameters. This work explores crucial factors influencing the convergence and generalization capacity of DFL models, emphasizing network topologies, non-IID data distribution, and training strategies. We first derive the convergence rate of different DFL model deployment strategies. Then, we comprehensively analyze various network topologies (e.g., linear, ring, star, and mesh) with different degrees of non-IID data and evaluate them over widely adopted machine learning models (e.g., classical, deep neural networks, and Large Language Models) and real-world datasets. The results reveal that models converge to the optimal one for IID data. However, the convergence rate is inversely proportional to the degree of non-IID data distribution. Our findings will serve as valuable guidelines for designing effective DFL model deployments in practical applications.

Title: A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection

Authors: Ximing Wen, Rezvaneh Rezapour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11838
Pdf URL: https://arxiv.org/pdf/2503.11838
Copy Paste: [[2503.11838]] A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection(https://arxiv.org/abs/2503.11838)
Keywords: interpretability, transformer
Abstract: Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm's inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model's inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.

Title: Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection

Authors: Tianwei Lan, Luca Demetrio, Farid Nait-Abdesselam, Yufei Han, Simone Aonzo
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11841
Pdf URL: https://arxiv.org/pdf/2503.11841
Copy Paste: [[2503.11841]] Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection(https://arxiv.org/abs/2503.11841)
Keywords: attack
Abstract: Machine learning (ML) malware detectors rely heavily on crowd-sourced AntiVirus (AV) labels, with platforms like VirusTotal serving as a trusted source of malware annotations. But what if attackers could manipulate these labels to classify benign software as malicious? We introduce label spoofing attacks, a new threat that contaminates crowd-sourced datasets by embedding minimal and undetectable malicious patterns into benign samples. These patterns coerce AV engines into misclassifying legitimate files as harmful, enabling poisoning attacks against ML-based malware classifiers trained on those data. We demonstrate this scenario by developing AndroVenom, a methodology for polluting realistic data sources, causing consequent poisoning attacks against ML malware detectors. Experiments show that not only state-of-the-art feature extractors are unable to filter such injection, but also various ML models experience Denial of Service already with 1% poisoned samples. Additionally, attackers can flip decisions of specific unaltered benign samples by modifying only 0.015% of the training data, threatening their reputation and market share and being unable to be stopped by anomaly detectors on training data. We conclude our manuscript by raising the alarm on the trustworthiness of the training process based on AV annotations, requiring further investigation on how to produce proper labels for ML malware detectors.

Title: Test-Time Training Provably Improves Transformers as In-context Learners

Authors: Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11842
Pdf URL: https://arxiv.org/pdf/2503.11842
Copy Paste: [[2503.11842]] Test-Time Training Provably Improves Transformers as In-context Learners(https://arxiv.org/abs/2503.11842)
Keywords: transformer
Abstract: Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.

Title: Local Pan-Privacy for Federated Analytics

Authors: Vitaly Feldman, Audra McMillan, Guy N. Rothblum, Kunal Talwar
Subjects: cs.CR, cs.DS, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11850
Pdf URL: https://arxiv.org/pdf/2503.11850
Copy Paste: [[2503.11850]] Local Pan-Privacy for Federated Analytics(https://arxiv.org/abs/2503.11850)
Keywords: privacy, federate
Abstract: Pan-privacy was proposed by Dwork et al. as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system's internal state. Motivated by federated telemetry applications, we study local pan-privacy, where privacy should be retained under repeated unannounced intrusions on the local state. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the goal of providing information-theoretic differential privacy under intrusion is incompatible with collecting telemetry information. We then show that this problem can be solved in a scalable way using standard cryptographic primitives.

Title: OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Authors: Ivan Kartáč, Mateusz Lango, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11858
Pdf URL: https://arxiv.org/pdf/2503.11858
Copy Paste: [[2503.11858]] OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs(https://arxiv.org/abs/2503.11858)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

Title: FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-the-World LoRA

Authors: Jieming Bian, Lei Wang, Letian Zhang, Jie Xu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11880
Pdf URL: https://arxiv.org/pdf/2503.11880
Copy Paste: [[2503.11880]] FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-the-World LoRA(https://arxiv.org/abs/2503.11880)
Keywords: privacy, federate, large language model
Abstract: Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-the-World (RoTW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoTW LoRA components using the Mixture-of-Experts (MoE) principle. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.

Title: GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation

Authors: Shadab Choudhury, Asha Kumar, Lara J. Martin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11881
Pdf URL: https://arxiv.org/pdf/2503.11881
Copy Paste: [[2503.11881]] GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation(https://arxiv.org/abs/2503.11881)
Keywords: large language model
Abstract: In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.

Title: DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

Authors: Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11892
Pdf URL: https://arxiv.org/pdf/2503.11892
Copy Paste: [[2503.11892]] DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning(https://arxiv.org/abs/2503.11892)
Keywords: transformer
Abstract: Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at this https URL and the code is available at this https URL.

Title: UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis

Authors: Md Abu Bakr Siddique, Junliang Liu, Piyush Singh, Md Jahidul Islam
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11893
Pdf URL: https://arxiv.org/pdf/2503.11893
Copy Paste: [[2503.11893]] UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis(https://arxiv.org/abs/2503.11893)
Keywords: robust
Abstract: The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: this https URL.

Title: Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Authors: Bhiman Kumar Baghel, Scott M. Jordan, Zheyuan Ryan Shi, Xiang Lorraine Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11895
Pdf URL: https://arxiv.org/pdf/2503.11895
Copy Paste: [[2503.11895]] Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing(https://arxiv.org/abs/2503.11895)
Keywords: large language model
Abstract: Large Language Models (LLMs) are used in various downstream language tasks, making it crucial to keep their knowledge up-to-date, but both retraining and fine-tuning the model can be costly. Model editing offers an efficient and effective alternative by a single update to only a key subset of model parameters. While being efficient, these methods are not perfect. Sometimes knowledge edits are unsuccessful, i.e., UnderEdit, or the edit contaminated neighboring knowledge that should remain unchanged, i.e., OverEdit. To address these limitations, we propose iterative model editing, based on our hypothesis that a single parameter update is often insufficient, to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to minimize OverEdit. Extensive experiments demonstrate that our methods effectively reduce UnderEdit up to 38 percentage points and OverEdit up to 6 percentage points across multiple model editing algorithms, LLMs, and benchmark datasets.

Title: PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications

Authors: Hilal Asi, Vitaly Feldman, Hannah Keller, Guy N. Rothblum, Kunal Talwar
Subjects: cs.CR, cs.DS, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11897
Pdf URL: https://arxiv.org/pdf/2503.11897
Copy Paste: [[2503.11897]] PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications(https://arxiv.org/abs/2503.11897)
Keywords: secure, privacy, protect, federate
Abstract: We revisit the problem of secure aggregation of high-dimensional vectors in a two-server system such as Prio. These systems are typically used to aggregate vectors such as gradients in private federated learning, where the aggregate itself is protected via noise addition to ensure differential privacy. Existing approaches require communication scaling with the dimensionality, and thus limit the dimensionality of vectors one can efficiently process in this setup. We propose PREAMBLE: Private Efficient Aggregation Mechanism for BLock-sparse Euclidean Vectors. PREAMBLE is a novel extension of distributed point functions that enables communication- and computation-efficient aggregation of block-sparse vectors, which are sparse vectors where the non-zero entries occur in a small number of clusters of consecutive coordinates. We then show that PREAMBLE can be combined with random sampling and privacy amplification by sampling results, to allow asymptotically optimal privacy-utility trade-offs for vector aggregation, at a fraction of the communication cost. When coupled with recent advances in numerical privacy accounting, our approach incurs a negligible overhead in noise variance, compared to the Gaussian mechanism used with Prio.

Title: LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models

Authors: Merve Tekgurler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11898
Pdf URL: https://arxiv.org/pdf/2503.11898
Copy Paste: [[2503.11898]] LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models(https://arxiv.org/abs/2503.11898)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable adaptability in performing various tasks, including machine translation (MT), without explicit training. Models such as OpenAI's GPT-4 and Google's Gemini are frequently evaluated on translation benchmarks and utilized as translation tools due to their high performance. This paper examines Gemini's performance in translating an 18th-century Ottoman Turkish manuscript, Prisoner of the Infidels: The Memoirs of Osman Agha of Timisoara, into English. The manuscript recounts the experiences of Osman Agha, an Ottoman subject who spent 11 years as a prisoner of war in Austria, and includes his accounts of warfare and violence. Our analysis reveals that Gemini's safety mechanisms flagged between 14 and 23 percent of the manuscript as harmful, resulting in untranslated passages. These safety settings, while effective in mitigating potential harm, hinder the model's ability to provide complete and accurate translations of historical texts. Through real historical examples, this study highlights the inherent challenges and limitations of current LLM safety implementations in the handling of sensitive and context-rich materials. These real-world instances underscore potential failures of LLMs in contemporary translation scenarios, where accurate and comprehensive translations are crucial-for example, translating the accounts of modern victims of war for legal proceedings or humanitarian documentation.

Title: Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction

Authors: Da Long, Shandian Zhe, Samuel Williams, Leonid Oliker, Zhe Bai
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.11899
Pdf URL: https://arxiv.org/pdf/2503.11899
Copy Paste: [[2503.11899]] Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction(https://arxiv.org/abs/2503.11899)
Keywords: transformer, generative
Abstract: Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and computational efficiency. However, they often fail to effectively capture multi-scale interactions or quantify the uncertainties inherent in the predictions. These limitations lead to rapid error accumulation, particularly in long-term forecasting of systems characterized by complex and coupled dynamics. To address these challenges, we propose a spatio-temporal Fourier transformer (StFT), in which each transformer block is designed to learn dynamics at a specific scale. By leveraging a structured hierarchy of StFT blocks, the model explicitly captures dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is integrated to estimate and mitigate predictive uncertainties, enhancing both the accuracy and reliability of long-term forecasts. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.

Title: Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Authors: Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11905
Pdf URL: https://arxiv.org/pdf/2503.11905
Copy Paste: [[2503.11905]] Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities(https://arxiv.org/abs/2503.11905)
Keywords: diffusion
Abstract: Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose Multi-Task Upcycling (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as experts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.

Title: A Survey on SAR ship classification using Deep Learning

Authors: Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11906
Pdf URL: https://arxiv.org/pdf/2503.11906
Copy Paste: [[2503.11906]] A Survey on SAR ship classification using Deep Learning(https://arxiv.org/abs/2503.11906)
Keywords: interpretability, explainability
Abstract: Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

Title: LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama

Authors: Naome A. Etori, Kevin Lu, Randu Karisa, Arturs Kanepajs
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11911
Pdf URL: https://arxiv.org/pdf/2503.11911
Copy Paste: [[2503.11911]] LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama(https://arxiv.org/abs/2503.11911)
Keywords: robust, large language model
Abstract: As large language models (LLMs) rapidly advance, evaluating their performance is critical. LLMs are trained on multilingual data, but their reasoning abilities are mainly evaluated using English datasets. Hence, robust evaluation frameworks are needed using high-quality non-English datasets, especially low-resource languages (LRLs). This study evaluates eight state-of-the-art (SOTA) LLMs on Latvian and Giriama using a Massive Multitask Language Understanding (MMLU) subset curated with native speakers for linguistic and cultural relevance. Giriama is benchmarked for the first time. Our evaluation shows that OpenAI's o1 model outperforms others across all languages, scoring 92.8\% in English, 88.8\% in Latvian, and 70.8\% in Giriama on 0-shot tasks. Mistral-large (35.6\%) and Llama-70B IT (41\%) have weak performance, on both Latvian and Giriama. Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.

Title: Implementation of classical client universal blind quantum computation with 8-state RSP in current architecture

Authors: Aman Gupta, Daniel Prasanth, Venkat Chandra Gunja
Subjects: cs.CR, quant-ph
Abstract URL: https://arxiv.org/abs/2503.11913
Pdf URL: https://arxiv.org/pdf/2503.11913
Copy Paste: [[2503.11913]] Implementation of classical client universal blind quantum computation with 8-state RSP in current architecture(https://arxiv.org/abs/2503.11913)
Keywords: secure, security
Abstract: The future of quantum computing architecture is most likely the one in which a large number of clients are either fully classical or have a very limited quantum capability while a very small number of servers having the capability to perform quantum computations and most quantum computational tasks are delegated to these quantum servers. In this architecture, it becomes very crucial that a classical/semi-classical client is able to keep the delegated data/ computation secure against eavesdroppers as well as the server itself, known as the blindness feature. In 2009, A. Broadbent et. al proposed a universal blind quantum computation (UBQC) protocol based on measurement-based quantum computation (MBQC) that enables a semi-classical client to delegate universal quantum computation to a quantum server, interactively and fetch the results while the computation itself remains blind to the server. In this work, we propose an implementation (with examples) of UBQC in the current quantum computing architecture, a fully classical client, a quantum server (IBM Quantum) and the computation does not proceed interactively (projective measurement basis is not decided by previous measurement outcome). We combined UBQC with the 8-state remote state preparation (RSP) protocol, to blindly prepare the initial cluster state, which is an initial resource state in UBQC protocol, to allow a completely classical client to perform delegated blind quantum computation. Such an implementation has already been shown to be secure in a game-based security setting, which is the weakest security model.

Title: A Framework for Evaluating Emerging Cyberattack Capabilities of AI

Authors: Mikel Rodriguez, Raluca Ada Popa, Four Flynn, Lihao Liang, Allan Dafoe, Anna Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11917
Pdf URL: https://arxiv.org/pdf/2503.11917
Copy Paste: [[2503.11917]] A Framework for Evaluating Emerging Cyberattack Capabilities of AI(https://arxiv.org/abs/2503.11917)
Keywords: security, defense, attack
Abstract: As frontier models become more capable, the community has attempted to evaluate their ability to enable cyberattacks. Performing a comprehensive evaluation and prioritizing defenses are crucial tasks in preparing for AGI safely. However, current cyber evaluation efforts are ad-hoc, with no systematic reasoning about the various phases of attacks, and do not provide a steer on how to use targeted defenses. In this work, we propose a novel approach to AI cyber capability evaluation that (1) examines the end-to-end attack chain, (2) helps to identify gaps in the evaluation of AI threats, and (3) helps defenders prioritize targeted mitigations and conduct AI-enabled adversary emulation to support red teaming. To achieve these goals, we propose adapting existing cyberattack chain frameworks to AI systems. We analyze over 12,000 instances of real-world attempts to use AI in cyberattacks catalogued by Google's Threat Intelligence Group. Using this analysis, we curate a representative collection of seven cyberattack chain archetypes and conduct a bottleneck analysis to identify areas of potential AI-driven cost disruption. Our evaluation benchmark consists of 50 new challenges spanning different phases of cyberattacks. Based on this, we devise targeted cybersecurity model evaluations, report on the potential for AI to amplify offensive cyber capabilities across specific attack phases, and conclude with recommendations on prioritizing defenses. In all, we consider this to be the most comprehensive AI cyber risk evaluation framework published so far.

Title: Practical Implications of Implementing Local Differential Privacy for Smart grids

Authors: Khadija Hafeez, Mubashir Husain Rehmani, Sumita Mishra, Donna OShea
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.11920
Pdf URL: https://arxiv.org/pdf/2503.11920
Copy Paste: [[2503.11920]] Practical Implications of Implementing Local Differential Privacy for Smart grids(https://arxiv.org/abs/2503.11920)
Keywords: privacy, attack
Abstract: Recent smart grid advancements enable near-realtime reporting of electricity consumption, raising concerns about consumer privacy. Differential privacy (DP) has emerged as a viable privacy solution, where a calculated amount of noise is added to the data by a trusted third party, or individual users perturb their information locally, and only send the randomized data to an aggregator for analysis safeguarding users and aggregators privacy. However, the practical implementation of a Local DP-based (LDP) privacy model for smart grids has its own challenges. In this paper, we discuss the challenges of implementing an LDP-based model for smart grids. We compare existing LDP mechanisms in smart grids for privacy preservation of numerical data and discuss different methods for selecting privacy parameters in the existing literature, their limitations and the non-existence of an optimal method for selecting the privacy parameters. We also discuss the challenges of translating theoretical models of LDP into a practical setting for smart grids for different utility functions, the impact of the size of data set on privacy and accuracy, and vulnerability of LDP-based smart grids to manipulation attacks. Finally, we discuss future directions in research for better practical applications in LDP based models for smart grids.

Title: RePanda: Pandas-powered Tabular Verification and Reasoning

Authors: Atoosa Malemir Chegini, Keivan Rezaei, Hamid Eghbalzadeh, Soheil Feizi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11921
Pdf URL: https://arxiv.org/pdf/2503.11921
Copy Paste: [[2503.11921]] RePanda: Pandas-powered Tabular Verification and Reasoning(https://arxiv.org/abs/2503.11921)
Keywords: robust
Abstract: Fact-checking tabular data is essential for ensuring the accuracy of structured information. However, existing methods often rely on black-box models with opaque reasoning. We introduce RePanda, a structured fact verification approach that translates claims into executable pandas queries, enabling interpretable and verifiable reasoning. To train RePanda, we construct PanTabFact, a structured dataset derived from the TabFact train set, where claims are paired with executable queries generated using DeepSeek-Chat and refined through automated error correction. Fine-tuning DeepSeek-coder-7B-instruct-v1.5 on PanTabFact, RePanda achieves 84.09% accuracy on the TabFact test set. To evaluate Out-of-Distribution (OOD) generalization, we interpret question-answer pairs from WikiTableQuestions as factual claims and refer to this dataset as WikiFact. Without additional fine-tuning, RePanda achieves 84.72% accuracy on WikiFact, significantly outperforming all other baselines and demonstrating strong OOD robustness. Notably, these results closely match the zero-shot performance of DeepSeek-Chat (671B), indicating that our fine-tuning approach effectively distills structured reasoning from a much larger model into a compact, locally executable 7B model. Beyond fact verification, RePanda extends to tabular question answering by generating executable queries that retrieve precise answers. To support this, we introduce PanWiki, a dataset mapping WikiTableQuestions to pandas queries. Fine-tuning on PanWiki, RePanda achieves 75.1% accuracy in direct answer retrieval. These results highlight the effectiveness of structured execution-based reasoning for tabular verification and question answering. We have publicly released the dataset on Hugging Face at datasets/AtoosaChegini/PanTabFact.

Title: REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

Authors: Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11924
Pdf URL: https://arxiv.org/pdf/2503.11924
Copy Paste: [[2503.11924]] REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives(https://arxiv.org/abs/2503.11924)
Keywords: generative, large language model
Abstract: This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

Title: Generating a Biometrically Unique and Realistic Iris Database

Authors: Jingxuan Zhang, Robert J. Hart, Ziqian Bi, Shiaofen Fang, Susan Walsh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11930
Pdf URL: https://arxiv.org/pdf/2503.11930
Copy Paste: [[2503.11930]] Generating a Biometrically Unique and Realistic Iris Database(https://arxiv.org/abs/2503.11930)
Keywords: security, privacy, attack, biometric, diffusion
Abstract: The use of the iris as a biometric identifier has increased dramatically over the last 30 years, prompting privacy and security concerns about the use of iris images in research. It can be difficult to acquire iris image databases due to ethical concerns, and this can be a barrier for those performing biometrics research. In this paper, we describe and show how to create a database of realistic, biometrically unidentifiable colored iris images by training a diffusion model within an open-source diffusion framework. Not only were we able to verify that our model is capable of creating iris textures that are biometrically unique from the training data, but we were also able to verify that our model output creates a full distribution of realistic iris pigmentations. We highlight the fact that the utility of diffusion networks to achieve these criteria with relative ease, warrants additional research in its use within the context of iris database generation and presentation attack security.

Title: Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

Authors: Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11937
Pdf URL: https://arxiv.org/pdf/2503.11937
Copy Paste: [[2503.11937]] Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder(https://arxiv.org/abs/2503.11937)
Keywords: robust, diffusion
Abstract: Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.

Title: Your Text Encoder Can Be An Object-Level Watermarking Controller

Authors: Naresh Kumar Devulapally, Mingzhen Huang, Vishal Asnani, Shruti Agarwal, Siwei Lyu, Vishnu Suresh Lokhande
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11945
Pdf URL: https://arxiv.org/pdf/2503.11945
Copy Paste: [[2503.11945]] Your Text Encoder Can Be An Object-Level Watermarking Controller(https://arxiv.org/abs/2503.11945)
Keywords: protect, robust, watermark, diffusion
Abstract: Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of T2I Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $W_*$, we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional full-image watermarking. Our method leverages the text encoder's compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99\%$ bit accuracy ($48$ bits) with a $10^5 \times$ reduction in model parameters, enabling efficient watermarking.

Title: Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis

Authors: Thivya Thogesan, Anupiya Nugaliyadde, Kok Wai Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11948
Pdf URL: https://arxiv.org/pdf/2503.11948
Copy Paste: [[2503.11948]] Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis(https://arxiv.org/abs/2503.11948)
Keywords: interpretability, explainability, large language model
Abstract: Interpretability remains a key difficulty in sentiment analysis with Large Language Models (LLMs), particularly in high-stakes applications where it is crucial to comprehend the rationale behind forecasts. This research addressed this by introducing a technique that applies SHAP (Shapley Additive Explanations) by breaking down LLMs into components such as embedding layer,encoder,decoder and attention layer to provide a layer-by-layer knowledge of sentiment prediction. The approach offers a clearer overview of how model interpret and categorise sentiment by breaking down LLMs into these parts. The method is evaluated using the Stanford Sentiment Treebank (SST-2) dataset, which shows how different sentences affect different layers. The effectiveness of layer-wise SHAP analysis in clarifying sentiment-specific token attributions is demonstrated by experimental evaluations, which provide a notable enhancement over current whole-model explainability techniques. These results highlight how the suggested approach could improve the reliability and transparency of LLM-based sentiment analysis in crucial applications.

Title: SPOC: Spatially-Progressing Object State Change Segmentation in Video

Authors: Priyanka Mandikal, Tushar Nagarajan, Alex Stoken, Zihui Xue, Kristen Grauman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11953
Pdf URL: https://arxiv.org/pdf/2503.11953
Copy Paste: [[2503.11953]] SPOC: Spatially-Progressing Object State Change Segmentation in Video(https://arxiv.org/abs/2503.11953)
Keywords: segmentation
Abstract: Object state changes in video reveal critical information about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., the unchopped avocado) versus when it has completed a state change (e.g., the chopped avocado), which limits applicability for any task requiring detailed information about the progress of the actions and its spatial localization. We propose to deepen the problem by introducing the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We introduce the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. Experiments on two datasets validate both the challenge of the new task as well as the promise of our model for localizing exactly where and how fast objects are changing in video. We further demonstrate useful implications for tracking activity progress to benefit robotic agents. Project page: this https URL

Title: CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts

Authors: Chong Su, Yingbin Fu, Zheyuan Hu, Jing Yang, Param Hanji, Shaojun Wang, Xuan Zhao, Cengiz Öztireli, Fangcheng Zhong
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11958
Pdf URL: https://arxiv.org/pdf/2503.11958
Copy Paste: [[2503.11958]] CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts(https://arxiv.org/abs/2503.11958)
Keywords: robust
Abstract: We introduce CHOrD, a novel framework for scalable synthesis of 3D indoor scenes, designed to create house-scale, collision-free, and hierarchically structured indoor digital twins. In contrast to existing methods that directly synthesize the scene layout as a scene graph or object list, CHOrD incorporates a 2D image-based intermediate layout representation, enabling effective prevention of collision artifacts by successfully capturing them as out-of-distribution (OOD) scenarios during generation. Furthermore, unlike existing methods, CHOrD is capable of generating scene layouts that adhere to complex floor plans with multi-modal controls, enabling the creation of coherent, house-wide layouts robust to both geometric and semantic variations in room structures. Additionally, we propose a novel dataset with expanded coverage of household items and room configurations, as well as significantly improved data quality. CHOrD demonstrates state-of-the-art performance on both the 3D-FRONT and our proposed datasets, delivering photorealistic, spatially coherent indoor scene synthesis adaptable to arbitrary floor plan variations.

Title: HInter: Exposing Hidden Intersectional Bias in Large Language Models

Authors: Badr Souani, Ezekiel Soremekun, Mike Papadakis, Setsuko Yokoyama, Sudipta Chattopadhyay, Yves Le Traon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11962
Pdf URL: https://arxiv.org/pdf/2503.11962
Copy Paste: [[2503.11962]] HInter: Exposing Hidden Intersectional Bias in Large Language Models(https://arxiv.org/abs/2503.11962)
Keywords: large language model
Abstract: Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.

Title: Effective and Efficient Cross-City Traffic Knowledge Transfer A Privacy-Preserving Perspective

Authors: Zhihao Zeng, Ziquan Fang, Yuting Huang, Lu Chen, Yunjun Gao
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.11963
Pdf URL: https://arxiv.org/pdf/2503.11963
Copy Paste: [[2503.11963]] Effective and Efficient Cross-City Traffic Knowledge Transfer A Privacy-Preserving Perspective(https://arxiv.org/abs/2503.11963)
Keywords: secure, privacy, protect, robust, federate
Abstract: Traffic prediction targets forecasting future traffic conditions using historical traffic data, serving a critical role in urban computing and transportation management. To mitigate the scarcity of traffic data while maintaining data privacy, numerous Federated Traffic Knowledge Transfer (FTT) approaches have been developed, which use transfer learning and federated learning to transfer traffic knowledge from data-rich cities to data-scarce cities, enhancing traffic prediction capabilities for the latter. However, current FTT approaches face challenges such as privacy leakage, cross-city data distribution discrepancies, low data quality, and inefficient knowledge transfer, limiting their privacy protection, effectiveness, robustness, and efficiency in real-world applications. To this end, we propose FedTT, an effective, efficient, and privacy-aware cross-city traffic knowledge transfer framework that transforms the traffic data domain from the data-rich cities and trains traffic models using the transformed data for the data-scarce cities. First, to safeguard data privacy, we propose a traffic secret transmission method that securely transmits and aggregates traffic domain-transformed data from source cities using a lightweight secret aggregation approach. Second, to mitigate the impact of traffic data distribution discrepancies on model performance, we introduce a traffic domain adapter to uniformly transform traffic data from the source cities' domains to that of the target city. Third, to improve traffic data quality, we design a traffic view imputation method to fill in and predict missing traffic data. Finally, to enhance transfer efficiency, FedTT is equipped with a federated parallel training method that enables the simultaneous training of multiple modules. Extensive experiments using 4 real-life datasets demonstrate that FedTT outperforms the 14 state-of-the-art baselines.

Title: Entropy-regularized Gradient Estimators for Approximate Bayesian Inference

Authors: Jasmeet Kaur
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11964
Pdf URL: https://arxiv.org/pdf/2503.11964
Copy Paste: [[2503.11964]] Entropy-regularized Gradient Estimators for Approximate Bayesian Inference(https://arxiv.org/abs/2503.11964)
Keywords: robust
Abstract: Effective uncertainty quantification is important for training modern predictive models with limited data, enhancing both accuracy and robustness. While Bayesian methods are effective for this purpose, they can be challenging to scale. When employing approximate Bayesian inference, ensuring the quality of samples from the posterior distribution in a computationally efficient manner is essential. This paper addresses the estimation of the Bayesian posterior to generate diverse samples by approximating the gradient flow of the Kullback-Leibler (KL) divergence and the cross entropy of the target approximation under the metric induced by the Stein Operator. It presents empirical evaluations on classification tasks to assess the method's performance and discuss its effectiveness for Model-Based Reinforcement Learning that uses uncertainty-aware network dynamics models.

Title: Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

Authors: Xi Wang, Hideaki Shimazaki
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11965
Pdf URL: https://arxiv.org/pdf/2503.11965
Copy Paste: [[2503.11965]] Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning(https://arxiv.org/abs/2503.11965)
Keywords: robust
Abstract: We introduce a novel framework for learning in neural networks by decomposing each neuron's weight vector into two distinct parts, $W_1$ and $W_2$, thereby modeling contrastive information directly at the neuron level. Traditional gradient descent stores both positive (target) and negative (non-target) feature information in a single weight vector, often obscuring fine-grained distinctions. Our approach, by contrast, maintains separate updates for target and non-target features, ultimately forming a single effective weight $W = W_1 - W_2$ that is more robust to noise and class imbalance. Experimental results on both regression (California Housing, Wine Quality) and classification (MNIST, Fashion-MNIST, CIFAR-10) tasks suggest that this decomposition enhances generalization and resists overfitting, especially when training data are sparse or noisy. Crucially, the inference complexity remains the same as in the standard $WX + \text{bias}$ setup, offering a practical solution for improved learning without additional inference-time overhead.

Title: Evaluation of Intra-operative Patient-specific Methods for Point Cloud Completion for Minimally Invasive Liver Interventions

Authors: Nakul Poudel, Zixin Yang, Kelly Merrell, Richard Simon, Cristian A. Linte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11969
Pdf URL: https://arxiv.org/pdf/2503.11969
Copy Paste: [[2503.11969]] Evaluation of Intra-operative Patient-specific Methods for Point Cloud Completion for Minimally Invasive Liver Interventions(https://arxiv.org/abs/2503.11969)
Keywords: robust, transformer
Abstract: The registration between the pre-operative model and the intra-operative surface is crucial in image-guided liver surgery, as it facilitates the effective use of pre-operative information during the procedure. However, the intra-operative surface, usually represented as a point cloud, often has limited coverage, especially in laparoscopic surgery, and is prone to holes and noise, posing significant challenges for registration methods. Point cloud completion methods have the potential to alleviate these issues. Thus, we explore six state-of-the-art point cloud completion methods to identify the optimal completion method for liver surgery applications. We focus on a patient-specific approach for liver point cloud completion from a partial liver surface under three cases: canonical pose, non-canonical pose, and canonical pose with noise. The transformer-based method, AdaPoinTr, outperforms all other methods to generate a complete point cloud from the given partial liver point cloud under the canonical pose. On the other hand, our findings reveal substantial performance degradation of these methods under non-canonical poses and noisy settings, highlighting the limitations of these methods, which suggests the need for a robust point completion method for its application in image-guided liver surgery.

Title: No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models

Authors: Charaka Vinayak Kumar, Ashok Urlana, Gopichand Kanumolu, Bala Mallikarjunarao Garlapati, Pruthwik Mishra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11985
Pdf URL: https://arxiv.org/pdf/2503.11985
Copy Paste: [[2503.11985]] No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models(https://arxiv.org/abs/2503.11985)
Keywords: large language model
Abstract: Advancements in Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks. Although LLMs have breached the state-of-the-art performance in various tasks, they often reflect different forms of bias present in the training data. In the light of this perceived limitation, we provide a unified evaluation of benchmarks using a set of representative LLMs that cover different forms of biases starting from physical characteristics to socio-economic categories. Moreover, we propose five prompting approaches to carry out the bias detection task across different aspects of bias. Further, we formulate three research questions to gain valuable insight in detecting biases in LLMs using different approaches and evaluation metrics across benchmarks. The results indicate that each of the selected LLMs suffer from one or the other form of bias with the LLaMA3.1-8B model being the least biased. Finally, we conclude the paper with the identification of key challenges and possible future directions.

Title: Applications of Large Language Model Reasoning in Feature Generation

Authors: Dharani Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11989
Pdf URL: https://arxiv.org/pdf/2503.11989
Copy Paste: [[2503.11989]] Applications of Large Language Model Reasoning in Feature Generation(https://arxiv.org/abs/2503.11989)
Keywords: extraction, large language model
Abstract: Large Language Models (LLMs) have revolutionized natural language processing through their state of art reasoning capabilities. This paper explores the convergence of LLM reasoning techniques and feature generation for machine learning tasks. We examine four key reasoning approaches: Chain of Thought, Tree of Thoughts, Retrieval-Augmented Generation, and Thought Space Exploration. Our analysis reveals how these approaches can be used to identify effective feature generation rules without having to manually specify search spaces. The paper categorizes LLM-based feature generation methods across various domains including finance, healthcare, and text analytics. LLMs can extract key information from clinical notes and radiology reports in healthcare, by enabling more efficient data utilization. In finance, LLMs facilitate text generation, summarization, and entity extraction from complex documents. We analyze evaluation methodologies for assessing feature quality and downstream performance, with particular attention to OCTree's decision tree reasoning approach that provides language-based feedback for iterative improvements. Current challenges include hallucination, computational efficiency, and domain adaptation. As of March 2025, emerging approaches include inference-time compute scaling, reinforcement learning, and supervised fine-tuning with model distillation. Future directions point toward multimodal feature generation, self-improving systems, and neuro-symbolic approaches. This paper provides a detailed overview of an emerging field that promises to automate and enhance feature engineering through language model reasoning.

Title: Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Authors: Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11995
Pdf URL: https://arxiv.org/pdf/2503.11995
Copy Paste: [[2503.11995]] Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition(https://arxiv.org/abs/2503.11995)
Keywords: transformer
Abstract: In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at this https URL.

Title: 3D Gaussian Splatting against Moving Objects for High-Fidelity Street Scene Reconstruction

Authors: Peizhen Zheng, Longfei Wei, Dongjing Jiang, Jianfei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12001
Pdf URL: https://arxiv.org/pdf/2503.12001
Copy Paste: [[2503.12001]] 3D Gaussian Splatting against Moving Objects for High-Fidelity Street Scene Reconstruction(https://arxiv.org/abs/2503.12001)
Keywords: robust
Abstract: The accurate reconstruction of dynamic street scenes is critical for applications in autonomous driving, augmented reality, and virtual reality. Traditional methods relying on dense point clouds and triangular meshes struggle with moving objects, occlusions, and real-time processing constraints, limiting their effectiveness in complex urban environments. While multi-view stereo and neural radiance fields have advanced 3D reconstruction, they face challenges in computational efficiency and handling scene dynamics. This paper proposes a novel 3D Gaussian point distribution method for dynamic street scene reconstruction. Our approach introduces an adaptive transparency mechanism that eliminates moving objects while preserving high-fidelity static scene details. Additionally, iterative refinement of Gaussian point distribution enhances geometric accuracy and texture representation. We integrate directional encoding with spatial position optimization to optimize storage and rendering efficiency, reducing redundancy while maintaining scene integrity. Experimental results demonstrate that our method achieves high reconstruction quality, improved rendering performance, and adaptability in large-scale dynamic environments. These contributions establish a robust framework for real-time, high-precision 3D reconstruction, advancing the practicality of dynamic scene modeling across multiple applications. The source code for this work is available to the public at this https URL

Title: ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object

Authors: Zhe Shan, Yang Liu, Lei Zhou, Cheng Yan, Heng Wang, Xia Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12006
Pdf URL: https://arxiv.org/pdf/2503.12006
Copy Paste: [[2503.12006]] ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object(https://arxiv.org/abs/2503.12006)
Keywords: segmentation
Abstract: The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at this https URL.

Title: Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Wu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.12008
Pdf URL: https://arxiv.org/pdf/2503.12008
Copy Paste: [[2503.12008]] Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis(https://arxiv.org/abs/2503.12008)
Keywords: privacy, attack, membership infer, diffusion
Abstract: Tabular data synthesis using diffusion models has gained significant attention for its potential to balance data utility and privacy. However, existing privacy evaluations often rely on heuristic metrics or weak membership inference attacks (MIA), leaving privacy risks inadequately assessed. In this work, we conduct a rigorous MIA study on diffusion-based tabular synthesis, revealing that state-of-the-art attacks designed for image models fail in this setting. We identify noise initialization as a key factor influencing attack efficacy and propose a machine-learning-driven approach that leverages loss features across different noises and time steps. Our method, implemented with a lightweight MLP, effectively learns membership signals, eliminating the need for manual optimization. Experimental results from the MIDST Challenge @ SaTML 2025 demonstrate the effectiveness of our approach, securing first place across all tracks. Code is available at this https URL.

Title: UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

Authors: Xin Jin, Haisheng Su, Kai Liu, Cong Ma, Wei Wu, Fei Hui, Junchi Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12009
Pdf URL: https://arxiv.org/pdf/2503.12009
Copy Paste: [[2503.12009]] UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection(https://arxiv.org/abs/2503.12009)
Keywords: transformer
Abstract: Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform "local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and global" spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.

Title: Mixed-feature Logistic Regression Robust to Distribution Shifts

Authors: Qingshi Sun, Nathan Justin, Andres Gomez, Phebe Vayanos
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12012
Pdf URL: https://arxiv.org/pdf/2503.12012
Copy Paste: [[2503.12012]] Mixed-feature Logistic Regression Robust to Distribution Shifts(https://arxiv.org/abs/2503.12012)
Keywords: robust, interpretability
Abstract: Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.

Title: QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

Authors: Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12015
Pdf URL: https://arxiv.org/pdf/2503.12015
Copy Paste: [[2503.12015]] QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.12015)
Keywords: diffusion
Abstract: Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at this https URL.

Title: A Survey on Federated Fine-tuning of Large Language Models

Authors: Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, Kahou Tam, Li Li, Chengzhong Xu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.12016
Pdf URL: https://arxiv.org/pdf/2503.12016
Copy Paste: [[2503.12016]] A Survey on Federated Fine-tuning of Large Language Models(https://arxiv.org/abs/2503.12016)
Keywords: privacy, federate, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, with fine-tuning playing a pivotal role in adapting them to specific downstream applications. Federated Learning (FL) offers a promising approach that enables collaborative model adaptation while ensuring data privacy, i.e., FedLLM. In this survey, we provide a systematic and thorough review of the integration of LLMs with FL. Specifically, we first trace the historical evolution of both LLMs and FL, while summarizing relevant prior surveys. We then present an in-depth analysis of the fundamental challenges encountered in deploying FedLLM. Following this, we conduct an extensive study of existing parameter-efficient fine-tuning (PEFT) methods and explore their applicability in FL. Furthermore, we introduce a comprehensive evaluation benchmark to rigorously assess FedLLM performance and discuss its diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to drive future advancements in FedLLM. We maintain an active \href{this https URL}{GitHub repository} tracking cutting-edge advancements. This survey serves as a foundational resource for researchers and practitioners, offering insights into the evolving landscape of federated fine-tuning for LLMs while guiding future innovations in privacy-preserving AI.

Title: Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

Authors: Zhe Jin, Tat-Seng Chua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12018
Pdf URL: https://arxiv.org/pdf/2503.12018
Copy Paste: [[2503.12018]] Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art(https://arxiv.org/abs/2503.12018)
Keywords: diffusion
Abstract: Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.

Title: Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning

Authors: Zihan Zhoua, Changrui Daia, Aibo Songa, Xiaolin Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12026
Pdf URL: https://arxiv.org/pdf/2503.12026
Copy Paste: [[2503.12026]] Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning(https://arxiv.org/abs/2503.12026)
Keywords: segmentation
Abstract: Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.

Title: Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder

Authors: Enes Erdogan, Eren Erdal Aksoy, Sanem Sariel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12034
Pdf URL: https://arxiv.org/pdf/2503.12034
Copy Paste: [[2503.12034]] Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder(https://arxiv.org/abs/2503.12034)
Keywords: extraction
Abstract: Recognition of human manipulation actions in real-time is essential for safe and effective human-robot interaction and collaboration. The challenge lies in developing a model that is both lightweight enough for real-time execution and capable of generalization. While some existing methods in the literature can run in real-time, they struggle with temporal scalability, i.e., they fail to adapt to long-duration manipulations effectively. To address this, leveraging the generalizable scene graph representations, we propose a new Factorized Graph Sequence Encoder network that not only runs in real-time but also scales effectively in the temporal dimension, thanks to its factorized encoder architecture. Additionally, we introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings. Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3\% and 5.6\% improvement in F1-macro score on the KIT Bimanual Action (Bimacs) Dataset and Collaborative Action (CoAx) Dataset, respectively. Moreover, we conduct an extensive ablation study to validate our network design choices. Finally, we compare our model with its architecturally similar RGB-based model on the Bimacs dataset and show the limitations of this model in contrast to ours on such an object-centric manipulation dataset.

Title: PSGait: Multimodal Gait Recognition using Parsing Skeleton

Authors: Hangrui Xu, Chuanrui Zhang, Zhengxian Wu, Peng Jiao, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12047
Pdf URL: https://arxiv.org/pdf/2503.12047
Copy Paste: [[2503.12047]] PSGait: Multimodal Gait Recognition using Parsing Skeleton(https://arxiv.org/abs/2503.12047)
Keywords: robust, biometric
Abstract: Gait recognition has emerged as a robust biometric modality due to its non-intrusive nature and resilience to occlusion. Conventional gait recognition methods typically rely on silhouettes or skeletons. Despite their success in gait recognition for controlled laboratory environments, they usually fail in real-world scenarios due to their limited information entropy for gait representations. To achieve accurate gait recognition in the wild, we propose a novel gait representation, named Parsing Skeleton. This representation innovatively introduces the skeleton-guided human parsing method to capture fine-grained body dynamics, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the parsing skeleton representation, we propose a novel parsing skeleton-based gait recognition framework, named PSGait, which takes parsing skeletons and silhouettes as input. By fusing these two modalities, the resulting image sequences are fed into gait recognition models for enhanced individual differentiation. We conduct comprehensive benchmarks on various datasets to evaluate our model. PSGait outperforms existing state-of-the-art multimodal methods. Furthermore, as a plug-and-play method, PSGait leads to a maximum improvement of 10.9% in Rank-1 accuracy across various gait recognition models. These results demonstrate the effectiveness and versatility of parsing skeletons for gait recognition in the wild, establishing PSGait as a new state-of-the-art approach for multimodal gait recognition.

Title: TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Authors: Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12049
Pdf URL: https://arxiv.org/pdf/2503.12049
Copy Paste: [[2503.12049]] TACO: Taming Diffusion for in-the-wild Video Amodal Completion(https://arxiv.org/abs/2503.12049)
Keywords: robust, diffusion
Abstract: Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at this https URL.

Title: TLUE: A Tibetan Language Understanding Evaluation Benchmark

Authors: Fan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12051
Pdf URL: https://arxiv.org/pdf/2503.12051
Copy Paste: [[2503.12051]] TLUE: A Tibetan Language Understanding Evaluation Benchmark(https://arxiv.org/abs/2503.12051)
Keywords: large language model
Abstract: Large language models (LLMs) have made tremendous progress in recent years, but low-resource languages, such as Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present TLUE (A Tibetan Language Understanding Evaluation Benchmark), the first large-scale benchmark for assessing LLMs' capabilities in Tibetan. TLUE comprises two major components: (1) a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and (2) a safety benchmark covering 7 subdomains. We evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most LLMs perform below the random baseline, highlighting the considerable challenges LLMs face in processing Tibetan, a low-resource language. TLUE provides an essential foundation for driving future research and progress in Tibetan language understanding and underscores the need for greater inclusivity in LLM development.

Title: Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System

Authors: Zhiyao Sun, Yu-Hui Wen, Matthieu Lin, Ho-Jui Fang, Sheng Ye, Tian Lv, Yong-Jin Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.12052
Pdf URL: https://arxiv.org/pdf/2503.12052
Copy Paste: [[2503.12052]] Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System(https://arxiv.org/abs/2503.12052)
Keywords: diffusion, generative, large language model
Abstract: Creating detailed 3D human avatars with garments typically requires specialized expertise and labor-intensive processes. Although recent advances in generative AI have enabled text-to-3D human/clothing generation, current methods fall short in offering accessible, integrated pipelines for producing ready-to-use clothed avatars. To solve this, we introduce Tailor, an integrated text-to-avatar system that generates high-fidelity, customizable 3D humans with simulation-ready garments. Our system includes a three-stage pipeline. We first employ a large language model to interpret textual descriptions into parameterized body shapes and semantically matched garment templates. Next, we develop topology-preserving deformation with novel geometric losses to adapt garments precisely to body geometries. Furthermore, an enhanced texture diffusion module with a symmetric local attention mechanism ensures both view consistency and photorealistic details. Quantitative and qualitative evaluations demonstrate that Tailor outperforms existing SoTA methods in terms of fidelity, usability, and diversity. Code will be available for academic use.

Title: Revisiting Training-Inference Trigger Intensity in Backdoor Attacks

Authors: Chenhao Lin, Chenyang Zhao, Shiwei Wang, Longtian Wang, Chao Shen, Zhengyu Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12058
Pdf URL: https://arxiv.org/pdf/2503.12058
Copy Paste: [[2503.12058]] Revisiting Training-Inference Trigger Intensity in Backdoor Attacks(https://arxiv.org/abs/2503.12058)
Keywords: security, defense, attack, steal
Abstract: Backdoor attacks typically place a specific trigger on certain training data, such that the model makes prediction errors on inputs with that trigger during inference. Despite the core role of the trigger, existing studies have commonly believed a perfect match between training-inference triggers is optimal. In this paper, for the first time, we systematically explore the training-inference trigger relation, particularly focusing on their mismatch, based on a Training-Inference Trigger Intensity Manipulation (TITIM) workflow. TITIM specifically investigates the training-inference trigger intensity, such as the size or the opacity of a trigger, and reveals new insights into trigger generalization and overfitting. These new insights challenge the above common belief by demonstrating that the training-inference trigger mismatch can facilitate attacks in two practical scenarios, posing more significant security threats than previously thought. First, when the inference trigger is fixed, using training triggers with mixed intensities leads to stronger attacks than using any single intensity. For example, on CIFAR-10 with ResNet-18, mixing training triggers with 1.0 and 0.1 opacities improves the worst-case attack success rate (ASR) (over different testing opacities) of the best single-opacity attack from 10.61\% to 92.77\%. Second, intentionally using certain mismatched training-inference triggers can improve the attack stealthiness, i.e., better bypassing defenses. For example, compared to the training/inference intensity of 1.0/1.0, using 1.0/0.7 decreases the area under the curve (AUC) of the Scale-Up defense from 0.96 to 0.62, while maintaining a high attack ASR (99.65\% vs. 91.62\%). The above new insights are validated to be generalizable across different backdoor attacks, models, datasets, tasks, and (digital/physical) domains.

Title: A Comprehensive Survey on Knowledge Distillation

Authors: Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, Shohreh Kasaei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12067
Pdf URL: https://arxiv.org/pdf/2503.12067
Copy Paste: [[2503.12067]] A Comprehensive Survey on Knowledge Distillation(https://arxiv.org/abs/2503.12067)
Keywords: diffusion, transformer, large language model
Abstract: Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: this https URL

Title: Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation

Authors: Qingchen Tang, Lei Fan, Maurice Pagnucco, Yang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12068
Pdf URL: https://arxiv.org/pdf/2503.12068
Copy Paste: [[2503.12068]] Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation(https://arxiv.org/abs/2503.12068)
Keywords: segmentation
Abstract: Weakly supervised image segmentation with image-level labels has drawn attention due to the high cost of pixel-level annotations. Traditional methods using Class Activation Maps (CAMs) often highlight only the most discriminative regions, leading to incomplete masks. Recent approaches that introduce textual information struggle with histopathological images due to inter-class homogeneity and intra-class heterogeneity. In this paper, we propose a prototype-based image prompting framework for histopathological image segmentation. It constructs an image bank from the training set using clustering, extracting multiple prototype features per class to capture intra-class heterogeneity. By designing a matching loss between input features and class-specific prototypes using contrastive learning, our method addresses inter-class homogeneity and guides the model to generate more accurate CAMs. Experiments on four datasets (LUAD-HistoSeg, BCSS-WSSS, GCSS, and BCSS) show that our method outperforms existing weakly supervised segmentation approaches, setting new benchmarks in histopathological image segmentation.

Title: Robust Dataset Distillation by Matching Adversarial Trajectories

Authors: Wei Lai, Tianyu Ding, ren dongdong, Lei Wang, Jing Huo, Yang Gao, Wenbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12069
Pdf URL: https://arxiv.org/pdf/2503.12069
Copy Paste: [[2503.12069]] Robust Dataset Distillation by Matching Adversarial Trajectories(https://arxiv.org/abs/2503.12069)
Keywords: attack, robust
Abstract: Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset distillation", a novel paradigm that embeds adversarial robustness into the synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process. As experimentally demonstrated, even through natural training on our distilled dataset, models can achieve enhanced adversarial robustness while maintaining competitive accuracy compared to existing distillation methods. Our work highlights robust dataset distillation as a new and important research direction and provides a strong baseline for future research to bridge the gap between efficient training and adversarial robustness.

Title: Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models

Authors: Abhilasha Ravichander, Jillian Fisher, Taylor Sorensen, Ximing Lu, Yuchen Lin, Maria Antoniak, Niloofar Mireshghallah, Chandra Bhagavatula, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12072
Pdf URL: https://arxiv.org/pdf/2503.12072
Copy Paste: [[2503.12072]] Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models(https://arxiv.org/abs/2503.12072)
Keywords: large language model
Abstract: High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work, we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model's ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.

Title: V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

Authors: Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12077
Pdf URL: https://arxiv.org/pdf/2503.12077
Copy Paste: [[2503.12077]] V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents(https://arxiv.org/abs/2503.12077)
Keywords: robust, large language model
Abstract: Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.

Title: SFMNet: Sparse Focal Modulation for 3D Object Detection

Authors: Oren Shrout, Ayellet Tal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12093
Pdf URL: https://arxiv.org/pdf/2503.12093
Copy Paste: [[2503.12093]] SFMNet: Sparse Focal Modulation for 3D Object Detection(https://arxiv.org/abs/2503.12093)
Keywords: transformer
Abstract: We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

Title: E-SAM: Training-Free Segment Every Entity Model

Authors: Weiming Zhang, Dingwen Xiao, Lei Chen, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12094
Pdf URL: https://arxiv.org/pdf/2503.12094
Copy Paste: [[2503.12094]] E-SAM: Training-Free Segment Every Entity Model(https://arxiv.org/abs/2503.12094)
Keywords: segmentation
Abstract: Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized to achieve the best ES without additional training overhead. Extensive experiments demonstrate that E-SAM achieves state-of-the-art performance compared to prior ES methods, demonstrating a significant improvement by +30.1 on benchmark metrics.

Title: Towards Vision Zero: The Accid3nD Dataset

Authors: Walter Zimmer, Ross Greer, Daniel Lehmberg, Marc Pavel, Holger Caesar, Xingcheng Zhou, Ahmed Ghita, Mohan Trivedi, Rui Song, Hu Cao, Akshay Gopalkrishnan, Alois C. Knoll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12095
Pdf URL: https://arxiv.org/pdf/2503.12095
Copy Paste: [[2503.12095]] Towards Vision Zero: The Accid3nD Dataset(https://arxiv.org/abs/2503.12095)
Keywords: robust
Abstract: Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as unavoidable and sporadic outcomes of traffic networks. No public dataset contains 3D annotations of real-world accidents recorded from roadside sensors. We present the Accid3nD dataset, a collection of real-world highway accidents in different weather and lighting conditions. It contains vehicle crashes at high-speed driving with 2,634,233 labeled 2D bounding boxes, instance masks, and 3D bounding boxes with track IDs. In total, the dataset contains 111,945 labeled frames recorded from four roadside cameras and LiDARs at 25 Hz. The dataset contains six object classes and is provided in the OpenLABEL format. We propose an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our website: this https URL.

Title: Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament

Authors: Arkadiusz Bryłkowski, Jakub Klikowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12100
Pdf URL: https://arxiv.org/pdf/2503.12100
Copy Paste: [[2503.12100]] Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament(https://arxiv.org/abs/2503.12100)
Keywords: large language model
Abstract: Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities' websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.

Title: A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI

Authors: Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Fangxu Xing, Xiaofeng Liu, Maureen Stone, Jiachen Zhuo, Juan Rafael Orozco-Arroyave, Elmar Nöth, Jana Hutter, Jerry L. Prince, Andreas Maier, Jonghye Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12102
Pdf URL: https://arxiv.org/pdf/2503.12102
Copy Paste: [[2503.12102]] A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI(https://arxiv.org/abs/2503.12102)
Keywords: diffusion
Abstract: Understanding the relationship between vocal tract motion during speech and the resulting acoustic signal is crucial for aided clinical assessment and developing personalized treatment and rehabilitation strategies. Toward this goal, we introduce an audio-to-video generation framework for creating Real Time/cine-Magnetic Resonance Imaging (RT-/cine-MRI) visuals of the vocal tract from speech signals. Our framework first preprocesses RT-/cine-MRI sequences and speech samples to achieve temporal alignment, ensuring synchronization between visual and audio data. We then employ a modified stable diffusion model, integrating structural and temporal blocks, to effectively capture movement characteristics and temporal dynamics in the synchronized data. This process enables the generation of MRI sequences from new speech inputs, improving the conversion of audio into visual data. We evaluated our framework on healthy controls and tongue cancer patients by analyzing and comparing the vocal tract movements in synthesized videos. Our framework demonstrated adaptability to new speech inputs and effective generalization. In addition, positive human evaluations confirmed its effectiveness, with realistic and accurate visualizations, suggesting its potential for outpatient therapy and personalized simulation of vocal tract visualizations.

Title: ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables

Authors: Sebastian Pineda Arango, Pedro Mercado, Shubham Kapoor, Abdul Fatir Ansari, Lorenzo Stella, Huibin Shen, Hugo Senetaire, Caner Turkmen, Oleksandr Shchur, Danielle C. Maddix, Michael Bohlke-Schneider, Yuyang Wang, Syama Sundar Rangapuram
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12107
Pdf URL: https://arxiv.org/pdf/2503.12107
Copy Paste: [[2503.12107]] ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables(https://arxiv.org/abs/2503.12107)
Keywords: large language model
Abstract: Covariates provide valuable information on external factors that influence time series and are critical in many real-world time series forecasting tasks. For example, in retail, covariates may indicate promotions or peak dates such as holiday seasons that heavily influence demand forecasts. Recent advances in pretraining large language model architectures for time series forecasting have led to highly accurate forecasters. However, the majority of these models do not readily use covariates as they are often specific to a certain task or domain. This paper introduces a new method to incorporate covariates into pretrained time series forecasting models. Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks that inject past and future covariate information, without necessarily modifying the pretrained model in consideration. In order to evaluate our approach, we introduce a benchmark composed of 32 different synthetic datasets with varying dynamics to evaluate the effectivity of forecasting models with covariates. Extensive evaluations on both synthetic and real datasets show that our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.

Title: RECSIP: REpeated Clustering of Scores Improving the Precision

Authors: André Schamschurko, Nenad Petrovic, Alois Christian Knoll
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12108
Pdf URL: https://arxiv.org/pdf/2503.12108
Copy Paste: [[2503.12108]] RECSIP: REpeated Clustering of Scores Improving the Precision(https://arxiv.org/abs/2503.12108)
Keywords: large language model
Abstract: The latest research on Large Language Models (LLMs) has demonstrated significant advancement in the field of Natural Language Processing (NLP). However, despite this progress, there is still a lack of reliability in these models. This is due to the stochastic architecture of LLMs, which presents a challenge for users attempting to ascertain the reliability of a model's response. These responses may cause serious harm in high-risk environments or expensive failures in industrial contexts. Therefore, we introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP) which focuses on improving the precision of LLMs by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response. The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.

Title: MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling

Authors: Zhaopeng Feng, Jiahan Ren, Jiayuan Su, Jiamei Zheng, Zhihang Tang, Hongwei Wang, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12123
Pdf URL: https://arxiv.org/pdf/2503.12123
Copy Paste: [[2503.12123]] MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling(https://arxiv.org/abs/2503.12123)
Keywords: large language model
Abstract: Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce \textbf{MT-RewardTree}, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in \href{this https URL}{this https URL\_RewardTreePage}.

Title: Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method

Authors: Hun Kang, Kyoungok Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12125
Pdf URL: https://arxiv.org/pdf/2503.12125
Copy Paste: [[2503.12125]] Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method(https://arxiv.org/abs/2503.12125)
Keywords: robust
Abstract: Isolation Forest (iForest) is an unsupervised anomaly detection algorithm designed to effectively detect anomalies under the assumption that anomalies are ``few and different." Various studies have aimed to enhance iForest, but the resulting algorithms often exhibited significant performance disparities across datasets. Additionally, the challenge of isolating rare and widely distributed anomalies persisted in research focused on improving splits. To address these challenges, we introduce Robust iForest (RiForest). RiForest leverages both existing features and random hyperplanes obtained through soft sparse random projection to identify superior split features for anomaly detection, independent of datasets. It utilizes the underutilized valley emphasis method for optimal split point determination and incorporates sparsity randomization in soft sparse random projection for enhanced anomaly detection robustness. Across 24 benchmark datasets, experiments demonstrate RiForest's consistent outperformance of existing algorithms in anomaly detection, emphasizing stability and robustness to noise variables.

Title: DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

Authors: Shentong Mo, Zehua Chen, Fan Bao, Jun Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.12131
Pdf URL: https://arxiv.org/pdf/2503.12131
Copy Paste: [[2503.12131]] DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap(https://arxiv.org/abs/2503.12131)
Keywords: robust, diffusion, generative
Abstract: Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.

Title: A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework

Authors: Ertuğrul Keçeci, Müjde Güzelkaya, Tufan Kumbasar
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.12137
Pdf URL: https://arxiv.org/pdf/2503.12137
Copy Paste: [[2503.12137]] A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework(https://arxiv.org/abs/2503.12137)
Keywords: federate
Abstract: This paper presents FedAlign, a Federated Learning (FL) framework particularly designed for System Identification (SYSID) tasks by aligning state representations. Local workers can learn State-Space Models (SSMs) with equivalent representations but different dynamics. We demonstrate that directly aggregating these local SSMs via FedAvg results in a global model with altered system dynamics. FedAlign overcomes this problem by employing similarity transformation matrices to align state representations of local SSMs, thereby establishing a common parameter basin that retains the dynamics of local SSMs. FedAlign computes similarity transformation matrices via two distinct approaches: FedAlign-A and FedAlign-O. In FedAlign-A, we represent the global SSM in controllable canonical form (CCF). We apply control theory to analytically derive similarity transformation matrices that convert each local SSM into this form. Yet, establishing global SSM in CCF brings additional alignment challenges in multi input - multi output SYSID as CCF representation is not unique, unlike in single input - single output SYSID. In FedAlign-O, we address these alignment challenges by reformulating the local parameter basin alignment problem as an optimization task. We determine the parameter basin of a local worker as the common parameter basin and solve least square problems to obtain similarity transformation matrices needed to align the remaining local SSMs. Through the experiments conducted on synthetic and real-world datasets, we show that FedAlign outperforms FedAvg, converges faster, and provides improved stability of the global SSM thanks to the efficient alignment of local parameter basins.

Title: Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis

Authors: Hongyu Sun, Qiuhong Ke, Ming Cheng, Yongcai Wang, Deying Li, Chenhui Gou, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12150
Pdf URL: https://arxiv.org/pdf/2503.12150
Copy Paste: [[2503.12150]] Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis(https://arxiv.org/abs/2503.12150)
Keywords: robust
Abstract: This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data-often inaccessible during online inference-and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop Point-Cache, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness. Code is available at this https URL.

Title: Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion

Authors: Bin Liu, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, Hao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12152
Pdf URL: https://arxiv.org/pdf/2503.12152
Copy Paste: [[2503.12152]] Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion(https://arxiv.org/abs/2503.12152)
Keywords: large language model
Abstract: Recent studies in prompting large language model (LLM) for document-level machine translation (DMT) primarily focus on the inter-sentence context by flatting the source document into a long sequence. This approach relies solely on the sequence of sentences within the document. However, the complexity of document-level sequences is greater than that of shorter sentence-level sequences, which may limit LLM's ability in DMT when only this single-source knowledge is used. In this paper, we propose an enhanced approach by incorporating multiple sources of knowledge, including both the document summarization and entity translation, to enhance the performance of LLM-based DMT. Given a source document, we first obtain its summarization and translation of entities via LLM as the additional knowledge. We then utilize LLMs to generate two translations of the source document by fusing these two single knowledge sources, respectively. Finally, recognizing that different sources of knowledge may aid or hinder the translation of different sentences, we refine and rank the translations by leveraging a multi-knowledge fusion strategy to ensure the best results. Experimental results in eight document-level translation tasks show that our approach achieves an average improvement of 0.8, 0.6, and 0.4 COMET scores over the baseline without extra knowledge for LLaMA3-8B-Instruct, Mistral-Nemo-Instruct, and GPT-4o-mini, respectively.

Title: Efficient and Privacy-Preserved Link Prediction via Condensed Graphs

Authors: Yunbo Long, Liming Xu, Alexandra Brintrup
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.12156
Pdf URL: https://arxiv.org/pdf/2503.12156
Copy Paste: [[2503.12156]] Efficient and Privacy-Preserved Link Prediction via Condensed Graphs(https://arxiv.org/abs/2503.12156)
Keywords: privacy
Abstract: Link prediction is crucial for uncovering hidden connections within complex networks, enabling applications such as identifying potential customers and products. However, this research faces significant challenges, including concerns about data privacy, as well as high computational and storage costs, especially when dealing with large-scale networks. Condensed graphs, which are much smaller than the original graphs while retaining essential information, has become an effective solution to both maintain data utility and preserve privacy. Existing methods, however, initialize synthetic graphs through random node selection without considering node connectivity, and are mainly designed for node classification tasks. As a result, their potential for privacy-preserving link prediction remains largely unexplored. We introduce HyDRO\textsuperscript{+}, a graph condensation method guided by algebraic Jaccard similarity, which leverages local connectivity information to optimize condensed graph structures. Extensive experiments on four real-world networks show that our method outperforms state-of-the-art methods and even the original networks in balancing link prediction accuracy and privacy preservation. Moreover, our method achieves nearly 20* faster training and reduces storage requirements by 452*, as demonstrated on the Computers dataset, compared to link prediction on the original networks. This work represents the first attempt to leverage condensed graphs for privacy-preserving link prediction information sharing in real-world complex networks. It offers a promising pathway for preserving link prediction information while safeguarding privacy, advancing the use of graph condensation in large-scale networks with privacy concerns.

Title: Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs

Authors: Milan Papež, Martin Rektoris, Václav Šmídl, Tomáš Pevný
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12162
Pdf URL: https://arxiv.org/pdf/2503.12162
Copy Paste: [[2503.12162]] Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs(https://arxiv.org/abs/2503.12162)
Keywords: generative
Abstract: Deep generative models (DGMs) have recently demonstrated remarkable success in capturing complex probability distributions over graphs. Although their excellent performance is attributed to powerful and scalable deep neural networks, it is, at the same time, exactly the presence of these highly non-linear transformations that makes DGMs intractable. Indeed, despite representing probability distributions, intractable DGMs deny probabilistic foundations by their inability to answer even the most basic inference queries without approximations or design choices specific to a very narrow range of queries. To address this limitation, we propose probabilistic graph circuits (PGCs), a framework of tractable DGMs that provide exact and efficient probabilistic inference over (arbitrary parts of) graphs. Nonetheless, achieving both exactness and efficiency is challenging in the permutation-invariant setting of graphs. We design PGCs that are inherently invariant and satisfy these two requirements, yet at the cost of low expressive power. Therefore, we investigate two alternative strategies to achieve the invariance: the first sacrifices the efficiency, and the second sacrifices the exactness. We demonstrate that ignoring the permutation invariance can have severe consequences in anomaly detection, and that the latter approach is competitive with, and sometimes better than, existing intractable DGMs in the context of molecular graph generation.

Title: PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

Authors: Cheng Deng, Luoyang Sun, Jiwen Jiang, Yongcheng Zeng, Xinjian Wu, Wenxin Zhao, Qingfa Xiao, Jiachuan Wang, Lei Chen, Lionel M. Ni, Haifeng Zhang, Jun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12167
Pdf URL: https://arxiv.org/pdf/2503.12167
Copy Paste: [[2503.12167]] PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing(https://arxiv.org/abs/2503.12167)
Keywords: large language model
Abstract: While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM's suitability for peripheral applications. The PLM series models are publicly available at this https URL.

Title: Learning Extremely High Density Crowds as Active Matters

Authors: Feixiang He, Jiangbei Yue, Jialin Zhu, Armin Seyfried, Dan Casas, Julien Pettré, He Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12168
Pdf URL: https://arxiv.org/pdf/2503.12168
Copy Paste: [[2503.12168]] Learning Extremely High Density Crowds as Active Matters(https://arxiv.org/abs/2503.12168)
Keywords: interpretability
Abstract: Video-based high-density crowd analysis and prediction has been a long-standing topic in computer vision. It is notoriously difficult due to, but not limited to, the lack of high-quality data and complex crowd dynamics. Consequently, it has been relatively under studied. In this paper, we propose a new approach that aims to learn from in-the-wild videos, often with low quality where it is difficult to track individuals or count heads. The key novelty is a new physics prior to model crowd dynamics. We model high-density crowds as active matter, a continumm with active particles subject to stochastic forces, named 'crowd material'. Our physics model is combined with neural networks, resulting in a neural stochastic differential equation system which can mimic the complex crowd dynamics. Due to the lack of similar research, we adapt a range of existing methods which are close to ours for comparison. Through exhaustive evaluation, we show our model outperforms existing methods in analyzing and forecasting extremely high-density crowds. Furthermore, since our model is a continuous-time physics model, it can be used for simulation and analysis, providing strong interpretability. This is categorically different from most deep learning methods, which are discrete-time models and black-boxes.

Title: SEAL: Semantic Aware Image Watermarking

Authors: Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.12172
Pdf URL: https://arxiv.org/pdf/2503.12172
Copy Paste: [[2503.12172]] SEAL: Semantic Aware Image Watermarking(https://arxiv.org/abs/2503.12172)
Keywords: attack, robust, watermark, diffusion, generative
Abstract: Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

Title: Multi-Agent Systems Execute Arbitrary Malicious Code

Authors: Harold Triedman, Rishi Jha, Vitaly Shmatikov
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12188
Pdf URL: https://arxiv.org/pdf/2503.12188
Copy Paste: [[2503.12188]] Multi-Agent Systems Execute Arbitrary Malicious Code(https://arxiv.org/abs/2503.12188)
Keywords: security, attack
Abstract: Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, etc. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device and/or exfiltration of sensitive data from the user's containerized environment. We show that control-flow hijacking attacks succeed even if the individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions.

Title: Breaking the Box: Enhancing Remote Sensing Image Segmentation with Freehand Sketches

Authors: Ying Zang, Yuncan Gao, Jiangi Zhang, Yuangi Hu, Runlong Cao, Lanyun Zhu, Qi Zhu, Deyi Ji, Renjun Xu, Tianrun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12191
Pdf URL: https://arxiv.org/pdf/2503.12191
Copy Paste: [[2503.12191]] Breaking the Box: Enhancing Remote Sensing Image Segmentation with Freehand Sketches(https://arxiv.org/abs/2503.12191)
Keywords: robust, segmentation
Abstract: This work advances zero-shot interactive segmentation for remote sensing imagery through three key contributions. First, we propose a novel sketch-based prompting method, enabling users to intuitively outline objects, surpassing traditional point or box prompts. Second, we introduce LTL-Sensing, the first dataset pairing human sketches with remote sensing imagery, setting a benchmark for future research. Third, we present LTL-Net, a model featuring a multi-input prompting transport module tailored for freehand sketches. Extensive experiments show our approach significantly improves segmentation accuracy and robustness over state-of-the-art methods like SAM, fostering more intuitive human-AI collaboration in remote sensing analysis and enhancing its applications.

Title: STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

Authors: Ruyu Wang, Xuefeng Hou, Sabrina Schmedding, Marco F. Huber
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12213
Pdf URL: https://arxiv.org/pdf/2503.12213
Copy Paste: [[2503.12213]] STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation(https://arxiv.org/abs/2503.12213)
Keywords: diffusion
Abstract: In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.

Title: Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

Authors: Sharmita Dey, Sarath Ravindran Nair
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2503.12214
Pdf URL: https://arxiv.org/pdf/2503.12214
Copy Paste: [[2503.12214]] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment(https://arxiv.org/abs/2503.12214)
Keywords: robust, diffusion
Abstract: We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.

Title: Gun Detection Using Combined Human Pose and Weapon Appearance

Authors: Amulya Reddy Maligireddy, Manohar Reddy Uppula, Nidhi Rastogi, Yaswanth Reddy Parla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12215
Pdf URL: https://arxiv.org/pdf/2503.12215
Copy Paste: [[2503.12215]] Gun Detection Using Combined Human Pose and Weapon Appearance(https://arxiv.org/abs/2503.12215)
Keywords: security, robust
Abstract: The increasing frequency of firearm-related incidents has necessitated advancements in security and surveillance systems, particularly in firearm detection within public spaces. Traditional gun detection methods rely on manual inspections and continuous human monitoring of CCTV footage, which are labor-intensive and prone to high false positive and negative rates. To address these limitations, we propose a novel approach that integrates human pose estimation with weapon appearance recognition using deep learning techniques. Unlike prior studies that focus on either body pose estimation or firearm detection in isolation, our method jointly analyzes posture and weapon presence to enhance detection accuracy in real-world, dynamic environments. To train our model, we curated a diverse dataset comprising images from open-source repositories such as IMFDB and Monash Guns, supplemented with AI-generated and manually collected images from web sources. This dataset ensures robust generalization and realistic performance evaluation under various surveillance conditions. Our research aims to improve the precision and reliability of firearm detection systems, contributing to enhanced public safety and threat mitigation in high-risk areas.

Title: TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation

Authors: Mayank Kumar, Jiaqi Xue, Mengxin Zheng, Qian Lou
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12217
Pdf URL: https://arxiv.org/pdf/2503.12217
Copy Paste: [[2503.12217]] TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation(https://arxiv.org/abs/2503.12217)
Keywords: secure, privacy
Abstract: Fully Homomorphic Encryption over the torus (TFHE) enables computation on encrypted data without decryption, making it a cornerstone of secure and confidential computing. Despite its potential in privacy preserving machine learning, secure multi party computation, private blockchain transactions, and secure medical diagnostics, its adoption remains limited due to cryptographic complexity and usability challenges. While various TFHE libraries and compilers exist, practical code generation remains a hurdle. We propose a compiler integrated framework to evaluate LLM inference and agentic optimization for TFHE code generation, focusing on logic gates and ReLU activation. Our methodology assesses error rates, compilability, and structural similarity across open and closedsource LLMs. Results highlight significant limitations in off-the-shelf models, while agentic optimizations such as retrieval augmented generation (RAG) and few-shot prompting reduce errors and enhance code fidelity. This work establishes the first benchmark for TFHE code generation, demonstrating how LLMs, when augmented with domain-specific feedback, can bridge the expertise gap in FHE code generation.

Title: Adaptive Label Correction for Robust Medical Image Segmentation with Noisy Labels

Authors: Chengxuan Qian, Kai Han, Siqi Ma, Chongwen Lyu, Zhenlong Yuan, Jun Chen, Zhe Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12218
Pdf URL: https://arxiv.org/pdf/2503.12218
Copy Paste: [[2503.12218]] Adaptive Label Correction for Robust Medical Image Segmentation with Noisy Labels(https://arxiv.org/abs/2503.12218)
Keywords: robust, segmentation
Abstract: Deep learning has shown remarkable success in medical image analysis, but its reliance on large volumes of high-quality labeled data limits its applicability. While noisy labeled data are easier to obtain, directly incorporating them into training can degrade model performance. To address this challenge, we propose a Mean Teacher-based Adaptive Label Correction (ALC) self-ensemble framework for robust medical image segmentation with noisy labels. The framework leverages the Mean Teacher architecture to ensure consistent learning under noise perturbations. It includes an adaptive label refinement mechanism that dynamically captures and weights differences across multiple disturbance versions to enhance the quality of noisy labels. Additionally, a sample-level uncertainty-based label selection algorithm is introduced to prioritize high-confidence samples for network updates, mitigating the impact of noisy annotations. Consistency learning is integrated to align the predictions of the student and teacher networks, further enhancing model robustness. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed framework, showing significant improvements in segmentation performance. By fully exploiting the strengths of the Mean Teacher structure, the ALC framework effectively processes noisy labels, adapts to challenging scenarios, and achieves competitive results compared to state-of-the-art methods.

Title: A Bubble-Cluster Federated Learning Framework for Privacy-Preserving Demand Forecasting on Heterogeneous Retail Data

Authors: Yunbo Long, Liming Xu, Ge Zheng, Alexandra Brintrup
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.12220
Pdf URL: https://arxiv.org/pdf/2503.12220
Copy Paste: [[2503.12220]] A Bubble-Cluster Federated Learning Framework for Privacy-Preserving Demand Forecasting on Heterogeneous Retail Data(https://arxiv.org/abs/2503.12220)
Keywords: secure, privacy, robust, federate, transformer
Abstract: Federated learning (FL) enables retailers to share model parameters for demand forecasting while maintaining privacy. However, heterogeneous data across diverse regions, driven by factors such as varying consumer behavior, poses challenges to the effectiveness of federated learning. To tackle this challenge, we propose Bubble-Cluster Federated Learning (BFL), a novel clustering-based federated learning framework tailored for sales prediction. By leveraging differential privacy and feature importance distribution, BFL groups retailers into distinct "bubbles", each forming its own federated learning (FL) system to effectively isolate data heterogeneity. Within each bubble, Transformer models are designed to predict local sales for each client. Our experiments demonstrate that BFL significantly surpasses FedAvg and outperforms local learning in demand forecasting performance across all participating clients. Compared to local learning, BFL can achieve a 5.4\% improvement in R\textsuperscript{2}, a 69\% reduction in RMSE, and a 45\% decrease in MAE. Our study highlights BFL's adaptability in enabling effective federated learning through dynamic adjustments to noise levels and the range of clients participating in each bubble. This approach strategically groups participants into distinct "bubbles" while proactively identifying and filtering out risky clients that could compromise the FL system. The findings demonstrate BFL's ability to enhance collaborative learning in regression tasks on heterogeneous data, achieving a balance between forecasting accuracy and privacy preservation in retail applications. Additionally, BFL's capability to detect and neutralize poisoned data from clients enhances the system's robustness and reliability, ensuring more secure and effective federated learning.

Title: Interpretation Gaps in LLM-Assisted Comprehension of Privacy Documents

Authors: Rinku Dewri
Subjects: cs.CL, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2503.12225
Pdf URL: https://arxiv.org/pdf/2503.12225
Copy Paste: [[2503.12225]] Interpretation Gaps in LLM-Assisted Comprehension of Privacy Documents(https://arxiv.org/abs/2503.12225)
Keywords: privacy, large language model
Abstract: This article explores the gaps that can manifest when using a large language model (LLM) to obtain simplified interpretations of data practices from a complex privacy policy. We exemplify these gaps to showcase issues in accuracy, completeness, clarity and representation, while advocating for continued research to realize an LLM's true potential in revolutionizing privacy management through personal assistants and automated compliance checking.

Title: Research on Large Language Model Cross-Cloud Privacy Protection and Collaborative Training based on Federated Learning

Authors: Ze Yang, Yihong Jin, Yihan Zhang, Juntian Liu, Xinhe Xu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12226
Pdf URL: https://arxiv.org/pdf/2503.12226
Copy Paste: [[2503.12226]] Research on Large Language Model Cross-Cloud Privacy Protection and Collaborative Training based on Federated Learning(https://arxiv.org/abs/2503.12226)
Keywords: security, privacy, protect, federate, large language model
Abstract: The fast development of large language models (LLMs) and popularization of cloud computing have led to increasing concerns on privacy safeguarding and data security of cross-cloud model deployment and training as the key challenges. We present a new framework for addressing these issues along with enabling privacy preserving collaboration on training between distributed clouds based on federated learning. Our mechanism encompasses cutting-edge cryptographic primitives, dynamic model aggregation techniques, and cross-cloud data harmonization solutions to enhance security, efficiency, and scalability to the traditional federated learning paradigm. Furthermore, we proposed a hybrid aggregation scheme to mitigate the threat of Data Leakage and to optimize the aggregation of model updates, thus achieving substantial enhancement on the model effectiveness and stability. Experimental results demonstrate that the training efficiency, privacy protection, and model accuracy of the proposed model compare favorably to those of the traditional federated learning method.

Title: LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Authors: Yihao Wang, Raphael Memmesheimer, Sven Behnke
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12230
Pdf URL: https://arxiv.org/pdf/2503.12230
Copy Paste: [[2503.12230]] LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps(https://arxiv.org/abs/2503.12230)
Keywords: transformer, large language model
Abstract: The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

Title: From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification

Authors: Yan Jiang, Hao Yu, Xu Cheng, Haoyu Chen, Zhaodong Sun, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12232
Pdf URL: https://arxiv.org/pdf/2503.12232
Copy Paste: [[2503.12232]] From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2503.12232)
Keywords: privacy, federate
Abstract: Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to SOTAs trained with shared data (privacy-unrestricted). We hope this work offers a novel research entry for deploying VI-ReID that fits real-world scenarios and can benefit the community.

Title: A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis

Authors: Soufiane Bacha, Huansheng Ning, Belarbi Mostefa, Doreen Sebastian Sarwatt, Sahraoui Dhelim
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2503.12239
Pdf URL: https://arxiv.org/pdf/2503.12239
Copy Paste: [[2503.12239]] A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis(https://arxiv.org/abs/2503.12239)
Keywords: privacy
Abstract: Accurate illness diagnosis is vital for effective treatment and patient safety. Machine learning models are widely used for cancer diagnosis based on historical medical data. However, data imbalance remains a major challenge, leading to hindering classifier performance and reliability. The SMOTEBoost method addresses this issue by generating synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary and can produce noisy samples. This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations. Firstly, RE-SMOTEBoost focuses on generating synthetic samples in overlapping regions to better capture the decision boundary using roulette wheel selection. Secondly, it incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data. Thirdly, we introduce a double regularization penalty to control the synthetic samples proximity to the decision boundary and avoid class overlap. These enhancements enable higher-quality oversampling of the minority class, resulting in a more balanced and effective training dataset. The proposed method outperforms existing state-of-the-art techniques when evaluated on imbalanced datasets. Compared to the top-performing sampling algorithms, RE-SMOTEBoost demonstrates a notable improvement of 3.22\% in accuracy and a variance reduction of 88.8\%. These results indicate that the proposed model offers a solid solution for medical settings, effectively overcoming data scarcity and severe imbalance caused by limited samples, data collection difficulties, and privacy constraints.

Title: RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Authors: Yuheng Jiang, Zhehao Shen, Chengcheng Guo, Yu Hong, Zhuo Su, Yingliang Zhang, Marc Habermann, Lan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12242
Pdf URL: https://arxiv.org/pdf/2503.12242
Copy Paste: [[2503.12242]] RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance(https://arxiv.org/abs/2503.12242)
Keywords: robust
Abstract: Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods focus either on replaying general dynamic scenes or animating human avatars, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.

Title: Electromagnetic Side-Channel Analysis of PRESENT Lightweight Cipher

Authors: Nilupulee A Gunathilake, Owen Lo, William J Buchanan, Ahmed Al-Dubai
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.12248
Pdf URL: https://arxiv.org/pdf/2503.12248
Copy Paste: [[2503.12248]] Electromagnetic Side-Channel Analysis of PRESENT Lightweight Cipher(https://arxiv.org/abs/2503.12248)
Keywords: protect, attack, robust
Abstract: Side-channel vulnerabilities pose an increasing threat to cryptographically protected devices. Consequently, it is crucial to observe information leakages through physical parameters such as power consumption and electromagnetic (EM) radiation to reduce susceptibility during interactions with cryptographic functions. EM side-channel attacks are becoming more prevalent. PRESENT is a promising lightweight cryptographic algorithm expected to be incorporated into Internet-of-Things (IoT) devices in the future. This research investigates the EM side-channel robustness of PRESENT using a correlation attack model. This work extends our previous Correlation EM Analysis (CEMA) of PRESENT with improved results. The attack targets the Substitution box (S-box) and can retrieve 8 bytes of the 10-byte encryption key with a minimum of 256 EM waveforms. This paper presents the process of EM attack modelling, encompassing both simple and correlation attacks, followed by a critical analysis.

Title: Cracking the PUMA Challenge in 24 Hours with CellViT++ and nnU-Net

Authors: Negar Shahamiri, Moritz Rempe, Lukas Heine, Jens Kleesiek, Fabian Hörst
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12269
Pdf URL: https://arxiv.org/pdf/2503.12269
Copy Paste: [[2503.12269]] Cracking the PUMA Challenge in 24 Hours with CellViT++ and nnU-Net(https://arxiv.org/abs/2503.12269)
Keywords: extraction, segmentation
Abstract: Automatic tissue segmentation and nuclei detection is an important task in pathology, aiding in biomarker extraction and discovery. The panoptic segmentation of nuclei and tissue in advanced melanoma (PUMA) challenge aims to improve tissue segmentation and nuclei detection in melanoma histopathology. Unlike many challenge submissions focusing on extensive model tuning, our approach emphasizes delivering a deployable solution within a 24-hour development timeframe, using out-of-the-box frameworks. The pipeline combines two models, namely CellViT++ for nuclei detection and nnU-Net for tissue segmentation. Our results demonstrate a significant improvement in tissue segmentation, achieving a Dice score of 0.750, surpassing the baseline score of 0.629. For nuclei detection, we obtained results comparable to the baseline in both challenge tracks. The code is publicly available at this https URL.

Title: Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12271
Pdf URL: https://arxiv.org/pdf/2503.12271
Copy Paste: [[2503.12271]] Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection(https://arxiv.org/abs/2503.12271)
Keywords: diffusion, transformer
Abstract: The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Title: Exploration of VLMs for Driver Monitoring Systems Applications

Authors: Paola Natalia Cañas, Marcos Nieto, Oihana Otaegui, Igor Rodríguez
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12281
Pdf URL: https://arxiv.org/pdf/2503.12281
Copy Paste: [[2503.12281]] Exploration of VLMs for Driver Monitoring Systems Applications(https://arxiv.org/abs/2503.12281)
Keywords: large language model
Abstract: In recent years, we have witnessed significant progress in emerging deep learning models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs). These models have demonstrated promising results, indicating a new era of Artificial Intelligence (AI) that surpasses previous methodologies. Their extensive knowledge and zero-shot capabilities suggest a paradigm shift in developing deep learning solutions, moving from data capturing and algorithm training to just writing appropriate prompts. While the application of these technologies has been explored across various industries, including automotive, there is a notable gap in the scientific literature regarding their use in Driver Monitoring Systems (DMS). This paper presents our initial approach to implementing VLMs in this domain, utilising the Driver Monitoring Dataset to evaluate their performance and discussing their advantages and challenges when implemented in real-world scenarios.

Title: Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study

Authors: Liying Han, Gaofeng Dong, Xiaomin Ouyang, Lance Kaplan, Federico Cerutti, Mani Srivastava
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12282
Pdf URL: https://arxiv.org/pdf/2503.12282
Copy Paste: [[2503.12282]] Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study(https://arxiv.org/abs/2503.12282)
Keywords: large language model
Abstract: Complex events (CEs) play a crucial role in CPS-IoT applications, enabling high-level decision-making in domains such as smart monitoring and autonomous systems. However, most existing models focus on short-span perception tasks, lacking the long-term reasoning required for CE detection. CEs consist of sequences of short-time atomic events (AEs) governed by spatiotemporal dependencies. Detecting them is difficult due to long, noisy sensor data and the challenge of filtering out irrelevant AEs while capturing meaningful patterns. This work explores CE detection as a case study for CPS-IoT foundation models capable of long-term reasoning. We evaluate three approaches: (1) leveraging large language models (LLMs), (2) employing various neural architectures that learn CE rules from data, and (3) adopting a neurosymbolic approach that integrates neural models with symbolic engines embedding human knowledge. Our results show that the state-space model, Mamba, which belongs to the second category, outperforms all methods in accuracy and generalization to longer, unseen sensor traces. These findings suggest that state-space models could be a strong backbone for CPS-IoT foundation models for long-span reasoning tasks.

Title: Bi-Criteria Optimization for Combinatorial Bandits: Sublinear Regret and Constraint Violation under Bandit Feedback

Authors: Vaneet Aggarwal, Shweta Jain, Subham Pokhriyal, Christopher John Quinn
Subjects: cs.LG, cs.AI, cs.GT, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12285
Pdf URL: https://arxiv.org/pdf/2503.12285
Copy Paste: [[2503.12285]] Bi-Criteria Optimization for Combinatorial Bandits: Sublinear Regret and Constraint Violation under Bandit Feedback(https://arxiv.org/abs/2503.12285)
Keywords: fair
Abstract: In this paper, we study bi-criteria optimization for combinatorial multi-armed bandits (CMAB) with bandit feedback. We propose a general framework that transforms discrete bi-criteria offline approximation algorithms into online algorithms with sublinear regret and cumulative constraint violation (CCV) guarantees. Our framework requires the offline algorithm to provide an $(\alpha, \beta)$-bi-criteria approximation ratio with $\delta$-resilience and utilize $\texttt{N}$ oracle calls to evaluate the objective and constraint functions. We prove that the proposed framework achieves sub-linear regret and CCV, with both bounds scaling as ${O}\left(\delta^{2/3} \texttt{N}^{1/3}T^{2/3}\log^{1/3}(T)\right)$. Crucially, the framework treats the offline algorithm with $\delta$-resilience as a black box, enabling flexible integration of existing approximation algorithms into the CMAB setting. To demonstrate its versatility, we apply our framework to several combinatorial problems, including submodular cover, submodular cost covering, and fair submodular maximization. These applications highlight the framework's broad utility in adapting offline guarantees to online bi-criteria optimization under bandit feedback.

Title: Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes

Authors: Da Wu, Zhanliang Wang, Quan Nguyen, Kai Wang
Subjects: cs.CL, cs.AI, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.12286
Pdf URL: https://arxiv.org/pdf/2503.12286
Copy Paste: [[2503.12286]] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes(https://arxiv.org/abs/2503.12286)
Keywords: large language model
Abstract: Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.

Title: The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Authors: Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12294
Pdf URL: https://arxiv.org/pdf/2503.12294
Copy Paste: [[2503.12294]] The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation(https://arxiv.org/abs/2503.12294)
Keywords: large language model
Abstract: We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

Title: Towards Learning High-Precision Least Squares Algorithms with Sequence Models

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Ré
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2503.12295
Pdf URL: https://arxiv.org/pdf/2503.12295
Copy Paste: [[2503.12295]] Towards Learning High-Precision Least Squares Algorithms with Sequence Models(https://arxiv.org/abs/2503.12295)
Keywords: transformer
Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

Title: One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise

Authors: Amirabbas Afzali, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, Sanjay Lall
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.12301
Pdf URL: https://arxiv.org/pdf/2503.12301
Copy Paste: [[2503.12301]] One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise(https://arxiv.org/abs/2503.12301)
Keywords: attack, robust, large language model
Abstract: Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Content-Aware Noise-Resilient Preference Optimization (CNRPO), a novel framework that addresses multiple sources of content-dependent noise in preference learning. CNRPO employs a multi-objective optimization approach to separate true preferences from content-aware noises, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control various noise sources within a single model. Theoretical analysis and extensive experiments on different synthetic noisy datasets demonstrate that CNRPO significantly improves alignment with primary human preferences while controlling for secondary noises and biases, such as response length and harmfulness.

Title: Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs

Authors: Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Chi Chen, Wei Ke, Helen Meng, Maosong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12303
Pdf URL: https://arxiv.org/pdf/2503.12303
Copy Paste: [[2503.12303]] Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs(https://arxiv.org/abs/2503.12303)
Keywords: large language model
Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning. Prevalent pre-training approaches focus on enhancing perception by training on high-quality image captions due to the extremely high cost of collecting chain-of-thought (CoT) reasoning data for improving reasoning. While leveraging advanced MLLMs for caption generation enhances scalability, the outputs often lack comprehensiveness and accuracy. In this paper, we introduce Self-Improving Cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs by enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose chain-of-description, an approach that improves an MLLM's systematic perception by enabling step-by-step visual understanding, ensuring greater comprehensiveness and accuracy. Additionally, we adopt a structured CoT reasoning technique to enable MLLMs to integrate in-depth multimodal reasoning. To construct a next-generation foundation MLLM with self-improved cognition, SIcog first equips an MLLM with systematic perception and reasoning abilities using minimal external annotations. The enhanced models then generate detailed captions and CoT reasoning data, which are further curated through self-consistency. This curated data is ultimately used to refine the MLLM during multimodal pre-training, facilitating next-generation foundation MLLM construction. Extensive experiments on both low- and high-resolution MLLMs across diverse benchmarks demonstrate that, with merely 213K self-generated pre-training samples, SIcog produces next-generation foundation MLLMs with significantly improved cognition, achieving benchmark-leading performance compared to prevalent pre-training approaches.

Title: Empirical Privacy Variance

Authors: Yuzheng Hu, Fan Wu, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.12314
Pdf URL: https://arxiv.org/pdf/2503.12314
Copy Paste: [[2503.12314]] Empirical Privacy Variance(https://arxiv.org/abs/2503.12314)
Keywords: privacy
Abstract: We propose the notion of empirical privacy variance and study it in the context of differentially private fine-tuning of language models. Specifically, we show that models calibrated to the same $(\varepsilon, \delta)$-DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research.

Title: Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots

Authors: Maciej P. Polak, Dane Morgan
Subjects: cs.CV, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12326
Pdf URL: https://arxiv.org/pdf/2503.12326
Copy Paste: [[2503.12326]] Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots(https://arxiv.org/abs/2503.12326)
Keywords: extraction, large language model
Abstract: Automated data extraction from research texts has been steadily improving, with the emergence of large language models (LLMs) accelerating progress even further. Extracting data from plots in research papers, however, has been such a complex task that it has predominantly been confined to manual data extraction. We show that current multimodal large language models, with proper instructions and engineered workflows, are capable of accurately extracting data from plots. This capability is inherent to the pretrained models and can be achieved with a chain-of-thought sequence of zero-shot engineered prompts we call PlotExtract, without the need to fine-tune. We demonstrate PlotExtract here and assess its performance on synthetic and published plots. We consider only plots with two axes in this analysis. For plots identified as extractable, PlotExtract finds points with over 90% precision (and around 90% recall) and errors in x and y position of around 5% or lower. These results prove that multimodal LLMs are a viable path for high-throughput data extraction for plots and in many circumstances can replace the current manual methods of data extraction.

Title: CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Authors: Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, Jiajun Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.12329
Pdf URL: https://arxiv.org/pdf/2503.12329
Copy Paste: [[2503.12329]] CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era(https://arxiv.org/abs/2503.12329)
Keywords: robust
Abstract: Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at this https URL.

Title: VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

Authors: Yunze Liu, Peiran Wu, Cheng Liang, Junxiao Shen, Limin Wang, Li Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12332
Pdf URL: https://arxiv.org/pdf/2503.12332
Copy Paste: [[2503.12332]] VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining(https://arxiv.org/abs/2503.12332)
Keywords: transformer, large language model
Abstract: Recent Mamba-based architectures for video understanding demonstrate promising computational efficiency and competitive performance, yet struggle with overfitting issues that hinder their scalability. To overcome this challenge, we introduce VideoMAP, a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. VideoMAP uses a 4:1 Mamba-to-Transformer ratio, effectively balancing computational cost and model capacity. This architecture, combined with our proposed frame-wise masked autoregressive pre-training strategy, delivers significant performance gains when scaling to larger models. Additionally, VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data. Experiments show that VideoMAP outperforms existing models across various datasets, including Kinetics-400, Something-Something V2, Breakfast, and COIN. Furthermore, we demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models, highlighting its ability to reduce memory usage and enable the processing of longer video sequences. The code is open-source at this https URL

Title: GS-3I: Gaussian Splatting for Surface Reconstruction from Illumination-Inconsistent Images

Authors: Tengfei Wang, Yongmao Hou, Zhaoning Zhang, Yiwei Xu, Zongqian Zhan, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12335
Pdf URL: https://arxiv.org/pdf/2503.12335
Copy Paste: [[2503.12335]] GS-3I: Gaussian Splatting for Surface Reconstruction from Illumination-Inconsistent Images(https://arxiv.org/abs/2503.12335)
Keywords: robust
Abstract: Accurate geometric surface reconstruction, providing essential environmental information for navigation and manipulation tasks, is critical for enabling robotic self-exploration and interaction. Recently, 3D Gaussian Splatting (3DGS) has gained significant attention in the field of surface reconstruction due to its impressive geometric quality and computational efficiency. While recent relevant advancements in novel view synthesis under inconsistent illumination using 3DGS have shown promise, the challenge of robust surface reconstruction under such conditions is still being explored. To address this challenge, we propose a method called GS-3I. Specifically, to mitigate 3D Gaussian optimization bias caused by underexposed regions in single-view images, based on Convolutional Neural Network (CNN), a tone mapping correction framework is introduced. Furthermore, inconsistent lighting across multi-view images, resulting from variations in camera settings and complex scene illumination, often leads to geometric constraint mismatches and deviations in the reconstructed surface. To overcome this, we propose a normal compensation mechanism that integrates reference normals extracted from single-view image with normals computed from multi-view observations to effectively constrain geometric inconsistencies. Extensive experimental evaluations demonstrate that GS-3I can achieve robust and accurate surface reconstruction across complex illumination scenarios, highlighting its effectiveness and versatility in this critical challenge. this https URL

Title: Augmented Adversarial Trigger Learning

Authors: Zhe Wang, Yanjun Qi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12339
Pdf URL: https://arxiv.org/pdf/2503.12339
Copy Paste: [[2503.12339]] Augmented Adversarial Trigger Learning(https://arxiv.org/abs/2503.12339)
Keywords: attack
Abstract: Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs.

Title: SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

Authors: Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12340
Pdf URL: https://arxiv.org/pdf/2503.12340
Copy Paste: [[2503.12340]] SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression(https://arxiv.org/abs/2503.12340)
Keywords: large language model
Abstract: Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at this https URL

Title: EXPRESS: An LLM-Generated Explainable Property Valuation System with Neighbor Imputation

Authors: Wei-Wei Du, Yung-Chien Wang, Wen-Chih Peng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12344
Pdf URL: https://arxiv.org/pdf/2503.12344
Copy Paste: [[2503.12344]] EXPRESS: An LLM-Generated Explainable Property Valuation System with Neighbor Imputation(https://arxiv.org/abs/2503.12344)
Keywords: interpretability
Abstract: The demand for property valuation has attracted significant attention from sellers, buyers, and customers applying for loans. Reviews of existing approaches have revealed shortcomings in terms of not being able to handle missing value situations, as well as lacking interpretability, which means they cannot be used in real-world applications. To address these challenges, we propose an LLM-Generated EXplainable PRopErty valuation SyStem with neighbor imputation called EXPRESS, which provides the customizable missing value imputation technique, and addresses the opaqueness of prediction by providing the feature-wise explanation generated by LLM. The dynamic nearest neighbor search finds similar properties depending on different application scenarios by property configuration set by users (e.g., house age as criteria for the house in rural areas, and locations for buildings in urban areas). Motivated by the human appraisal procedure, we generate feature-wise explanations to provide users with a more intuitive understanding of the prediction results.

Title: General Table Question Answering via Answer-Formula Joint Generation

Authors: Zhongyuan Wang, Richong Zhang, Zhijie Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12345
Pdf URL: https://arxiv.org/pdf/2503.12345
Copy Paste: [[2503.12345]] General Table Question Answering via Answer-Formula Joint Generation(https://arxiv.org/abs/2503.12345)
Keywords: large language model
Abstract: Advanced table question answering (TableQA) methods prompt large language models (LLMs) to generate answer text, SQL query, Python code, or custom operations, which impressively improve the complex reasoning problems in the TableQA task. However, these methods lack the versatility to cope with specific question types or table structures. In contrast, the Spreadsheet Formula, the widely-used and well-defined operation language for tabular data, has not been thoroughly explored to solve TableQA. In this paper, we first attempt to use Formula as the logical form for solving complex reasoning on the tables with different structures. Specifically, we construct a large Formula-annotated TableQA dataset \texttt{FromulaQA} from existing datasets. In addition, we propose \texttt{TabAF}, a general table answering framework to solve multiple types of tasks over multiple types of tables simultaneously. Unlike existing methods, \texttt{TabAF} decodes answers and Formulas with a single LLM backbone, demonstrating great versatility and generalization. \texttt{TabAF} based on Llama3.1-70B achieves new state-of-the-art performance on the WikiTableQuestion, HiTab and TabFact.

Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12347
Pdf URL: https://arxiv.org/pdf/2503.12347
Copy Paste: [[2503.12347]] Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs(https://arxiv.org/abs/2503.12347)
Keywords: privacy, large language model
Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution, depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Title: ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation

Authors: Mo Zhou, Jianwei Wang, Xuanmeng Zhang, Dylan Campbell, Kai Wang, Long Yuan, Wenjie Zhang, Xuemin Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12348
Pdf URL: https://arxiv.org/pdf/2503.12348
Copy Paste: [[2503.12348]] ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation(https://arxiv.org/abs/2503.12348)
Keywords: diffusion
Abstract: This paper studies optical flow estimation, a critical task in motion analysis with applications in autonomous navigation, action recognition, and film production. Traditional optical flow methods require consecutive frames, which are often unavailable due to limitations in data acquisition or real-world scene disruptions. Thus, single-frame optical flow estimation is emerging in the literature. However, existing single-frame approaches suffer from two major limitations: (1) they rely on labeled training data, making them task-specific, and (2) they produce deterministic predictions, failing to capture motion uncertainty. To overcome these challenges, we propose ProbDiffFlow, a training-free framework that estimates optical flow distributions from a single image. Instead of directly predicting motion, ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates diverse plausible future frames using a diffusion-based model, then estimates motion from these synthesized samples using a pre-trained optical flow model, and finally aggregates the results into a probabilistic flow distribution. This design eliminates the need for task-specific training while capturing multiple plausible motions. Experiments on both synthetic and real-world datasets demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and efficiency, outperforming existing single-image and two-frame baselines.

Title: ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions

Authors: Wenqing Kuang (1), Xiongwei Zhao (2), Yehui Shen (1), Congcong Wen (3), Huimin Lu (1), Zongtan Zhou (1), Xieyuanli Chen (1) ((1) National University of Defense Technology, (2) Harbin Institute of Technology, (3) New York University Abu Dhabi)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12350
Pdf URL: https://arxiv.org/pdf/2503.12350
Copy Paste: [[2503.12350]] ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions(https://arxiv.org/abs/2503.12350)
Keywords: robust
Abstract: LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: this https URL.

Title: Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions

Authors: Farhad Pourkamali-Anaraki
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12354
Pdf URL: https://arxiv.org/pdf/2503.12354
Copy Paste: [[2503.12354]] Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions(https://arxiv.org/abs/2503.12354)
Keywords: robust
Abstract: Traditional neural network regression models provide only point estimates, failing to capture predictive uncertainty. Probabilistic neural networks (PNNs) address this limitation by producing output distributions, enabling the construction of prediction intervals. However, the common assumption of Gaussian output distributions often results in overly wide intervals, particularly in the presence of outliers or deviations from normality. To enhance the adaptability of PNNs, we propose t-Distributed Neural Networks (TDistNNs), which generate t-distributed outputs, parameterized by location, scale, and degrees of freedom. The degrees of freedom parameter allows TDistNNs to model heavy-tailed predictive distributions, improving robustness to non-Gaussian data and enabling more adaptive uncertainty quantification. We develop a novel loss function tailored for the t-distribution and derive efficient gradient computations for seamless integration into deep learning frameworks. Empirical evaluations on synthetic and real-world data demonstrate that TDistNNs improve the balance between coverage and interval width. Notably, for identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage. This work contributes a flexible framework for uncertainty estimation in neural networks tasked with regression, particularly suited to settings involving complex output distributions.

Title: Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Authors: Byung Hyun Lee, Sungjin Lim, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12356
Pdf URL: https://arxiv.org/pdf/2503.12356
Copy Paste: [[2503.12356]] Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation(https://arxiv.org/abs/2503.12356)
Keywords: robust, diffusion
Abstract: Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.

Title: ASD Classification on Dynamic Brain Connectome using Temporal Random Walk with Transformer-based Dynamic Network Embedding

Authors: Suchanuch Piriyasatit, Chaohao Yuan, Ercan Engin Kuruoglu
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.12366
Pdf URL: https://arxiv.org/pdf/2503.12366
Copy Paste: [[2503.12366]] ASD Classification on Dynamic Brain Connectome using Temporal Random Walk with Transformer-based Dynamic Network Embedding(https://arxiv.org/abs/2503.12366)
Keywords: transformer
Abstract: Autism Spectrum Disorder (ASD) is a complex neurological condition characterized by varied developmental impairments, especially in communication and social interaction. Accurate and early diagnosis of ASD is crucial for effective intervention, which is enhanced by richer representations of brain activity. The brain functional connectome, which refers to the statistical relationships between different brain regions measured through neuroimaging, provides crucial insights into brain function. Traditional static methods often fail to capture the dynamic nature of brain activity, in contrast, dynamic brain connectome analysis provides a more comprehensive view by capturing the temporal variations in the brain. We propose BrainTWT, a novel dynamic network embedding approach that captures temporal evolution of the brain connectivity over time and considers also the dynamics between different temporal network snapshots. BrainTWT employs temporal random walks to capture dynamics across different temporal network snapshots and leverages the Transformer's ability to model long term dependencies in sequential data to learn the discriminative embeddings from these temporal sequences using temporal structure prediction tasks. The experimental evaluation, utilizing the Autism Brain Imaging Data Exchange (ABIDE) dataset, demonstrates that BrainTWT outperforms baseline methods in ASD classification.

Title: L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

Authors: Ruoyu Wang, Yukai Ma, Yi Yao, Sheng Tao, Haoang Li, Zongzhi Zhu, Yong Liu, Xingxing Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12369
Pdf URL: https://arxiv.org/pdf/2503.12369
Copy Paste: [[2503.12369]] L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model(https://arxiv.org/abs/2503.12369)
Keywords: transformer
Abstract: Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:this https URL.

Title: Deepfake Detection with Optimized Hybrid Model: EAR Biometric Descriptor via Improved RCNN

Authors: Ruchika Sharma, Rudresh Dwivedi
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.12381
Pdf URL: https://arxiv.org/pdf/2503.12381
Copy Paste: [[2503.12381]] Deepfake Detection with Optimized Hybrid Model: EAR Biometric Descriptor via Improved RCNN(https://arxiv.org/abs/2503.12381)
Keywords: robust, biometric
Abstract: Deepfake is a widely used technology employed in recent years to create pernicious content such as fake news, movies, and rumors by altering and substituting facial information from various sources. Given the ongoing evolution of deepfakes investigation of continuous identification and prevention is crucial. Due to recent technological advancements in AI (Artificial Intelligence) distinguishing deepfakes and artificially altered images has become challenging. This approach introduces the robust detection of subtle ear movements and shape changes to generate ear descriptors. Further, we also propose a novel optimized hybrid deepfake detection model that considers the ear biometric descriptors via enhanced RCNN (Region-Based Convolutional Neural Network). Initially, the input video is converted into frames and preprocessed through resizing, normalization, grayscale conversion, and filtering processes followed by face detection using the Viola-Jones technique. Next, a hybrid model comprising DBN (Deep Belief Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) is utilized for deepfake detection based on ear descriptors. The output from the detection phase is determined through improved score-level fusion. To enhance the performance, the weights of both detection models are optimally tuned using the SU-JFO (Self-Upgraded Jellyfish Optimization method). Experimentation is conducted based on four scenarios: compression, noise, rotation, pose, and illumination on three different datasets. The performance results affirm that our proposed method outperforms traditional models such as CNN (Convolution Neural Network), SqueezeNet, LeNet, LinkNet, LSTM (Long Short-Term Memory), DFP (Deepfake Predictor) [1], and ResNext+CNN+LSTM [2] in terms of various performance metrics viz. accuracy, specificity, and precision.

Title: Pathology Image Restoration via Mixture of Prompts

Authors: Jiangdong Cai, Yan Chen, Zhenrong Shen, Haotian Jiang, Honglin Xiong, Kai Xuan, Lichi Zhang, Qian Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12399
Pdf URL: https://arxiv.org/pdf/2503.12399
Copy Paste: [[2503.12399]] Pathology Image Restoration via Mixture of Prompts(https://arxiv.org/abs/2503.12399)
Keywords: extraction, diffusion, transformer
Abstract: In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at this https URL.

Title: MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification

Authors: Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Yang Zhao, Hong Cheng, Huazhu Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12401
Pdf URL: https://arxiv.org/pdf/2503.12401
Copy Paste: [[2503.12401]] MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification(https://arxiv.org/abs/2503.12401)
Keywords: robust, diffusion, generative
Abstract: Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks-Camelyon16, TCGA-NSCLC, and BRACS consistently achieving state-of-the-art performance in both binary and multi-class tasks.

Title: SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation

Authors: Jianhao Yang, Wenshuo Yu, Yuanchao Lv, Jiance Sun, Bokang Sun, Mingyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12404
Pdf URL: https://arxiv.org/pdf/2503.12404
Copy Paste: [[2503.12404]] SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation(https://arxiv.org/abs/2503.12404)
Keywords: extraction, segmentation
Abstract: Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation, which comes with high time costs and is subject to subjective interference, resulting in distortion of label boundaries and often a loss of detail. To solve the above problems, our work proposes an Edge-enhanced Labeling Network, called SAM2-ELNet, which incorporates a labeling module and an edge attention mechanism. This model effectively addresses issues such as label detail loss, fragmentation, and inaccurate boundaries. Due to the scarcity of manually annotated remote sensing data, the feature extraction capabilities of traditional neural networks are limited. Our method uses the Hiera backbone of the pre-trained self-supervised large model segment anything model 2 (SAM2) as the encoder, achieves high-quality and efficient feature extraction even with small samples by fine-tuning on downstream tasks. This study compared the training effects of original and enhanced labels on the manually annotated Deep-SAR Oil Spill (SOS) dataset. Results showed that the model trained with enhanced labels performed better and had a lower final loss, indicating closer alignment with the real data distribution. Our work also explores the potential of extending the model into an efficient automatic annotation framework through generalization experiments, facilitating large-scale remote sensing image interpretation and intelligent recognition.

Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

Authors: Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12440
Pdf URL: https://arxiv.org/pdf/2503.12440
Copy Paste: [[2503.12440]] HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs(https://arxiv.org/abs/2503.12440)
Keywords: robust, large language model
Abstract: The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at this https URL

Title: Consistent-Point: Consistent Pseudo-Points for Semi-Supervised Crowd Counting and Localization

Authors: Yuda Zou, Zelong Liu, Yuliang Gu, Bo Du, Yongchao Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12441
Pdf URL: https://arxiv.org/pdf/2503.12441
Copy Paste: [[2503.12441]] Consistent-Point: Consistent Pseudo-Points for Semi-Supervised Crowd Counting and Localization(https://arxiv.org/abs/2503.12441)
Keywords: security
Abstract: Crowd counting and localization are important in applications such as public security and traffic management. Existing methods have achieved impressive results thanks to extensive laborious annotations. This paper propose a novel point-localization-based semi-supervised crowd counting and localization method termed Consistent-Point. We identify and address two inconsistencies of pseudo-points, which have not been adequately explored. To enhance their position consistency, we aggregate the positions of neighboring auxiliary proposal-points. Additionally, an instance-wise uncertainty calibration is proposed to improve the class consistency of pseudo-points. By generating more consistent pseudo-points, Consistent-Point provides more stable supervision to the training process, yielding improved results. Extensive experiments across five widely used datasets and three different labeled ratio settings demonstrate that our method achieves state-of-the-art performance in crowd localization while also attaining impressive crowd counting results. The code will be available.

Title: BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

Authors: Tianle Li, Yongming Rao, Winston Hu, Yu Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12446
Pdf URL: https://arxiv.org/pdf/2503.12446
Copy Paste: [[2503.12446]] BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries(https://arxiv.org/abs/2503.12446)
Keywords: large language model
Abstract: Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM's textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches.

Title: Causality Model for Semantic Understanding on Videos

Authors: Li Yicong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12447
Pdf URL: https://arxiv.org/pdf/2503.12447
Copy Paste: [[2503.12447]] Causality Model for Semantic Understanding on Videos(https://arxiv.org/abs/2503.12447)
Keywords: robust
Abstract: After a decade of prosperity, the development of video understanding has reached a critical juncture, where the sole reliance on massive data and complex architectures is no longer a one-size-fits-all solution to all situations. The presence of ubiquitous data imbalance hampers DNNs from effectively learning the underlying causal mechanisms, leading to significant performance drops when encountering distribution shifts, such as long-tail imbalances and perturbed imbalances. This realization has prompted researchers to seek alternative methodologies to capture causal patterns in video data. To tackle these challenges and increase the robustness of DNNs, causal modeling emerged as a principle to discover the true causal patterns behind the observed correlations. This thesis focuses on the domain of semantic video understanding and explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA).

Title: ISLR101: an Iranian Word-Level Sign Language Recognition Dataset

Authors: Hossein Ranjbar, Alireza Taheri
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12451
Pdf URL: https://arxiv.org/pdf/2503.12451
Copy Paste: [[2503.12451]] ISLR101: an Iranian Word-Level Sign Language Recognition Dataset(https://arxiv.org/abs/2503.12451)
Keywords: fair
Abstract: Sign language recognition involves modeling complex multichannel information, such as hand shapes and movements while relying on sufficient sign language-specific data. However, sign languages are often under-resourced, posing a significant challenge for research and development in this field. To address this gap, we introduce ISLR101, the first publicly available Iranian Sign Language dataset for isolated sign language recognition. This comprehensive dataset includes 4,614 videos covering 101 distinct signs, recorded by 10 different signers (3 deaf individuals, 2 sign language interpreters, and 5 L2 learners) against varied backgrounds, with a resolution of 800x600 pixels and a frame rate of 25 frames per second. It also includes skeleton pose information extracted using OpenPose. We establish both a visual appearance-based and a skeleton-based framework as baseline models, thoroughly training and evaluating them on ISLR101. These models achieve 97.01% and 94.02% accuracy on the test set, respectively. Additionally, we publish the train, validation, and test splits to facilitate fair comparisons.

Title: Shape Bias and Robustness Evaluation via Cue Decomposition for Image Classification and Segmentation

Authors: Edgar Heinert, Thomas Gottwald, Annika Mütze, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12453
Pdf URL: https://arxiv.org/pdf/2503.12453
Copy Paste: [[2503.12453]] Shape Bias and Robustness Evaluation via Cue Decomposition for Image Classification and Segmentation(https://arxiv.org/abs/2503.12453)
Keywords: robust, segmentation
Abstract: Previous works studied how deep neural networks (DNNs) perceive image content in terms of their biases towards different image cues, such as texture and shape. Previous methods to measure shape and texture biases are typically style-transfer-based and limited to DNNs for image classification. In this work, we provide a new evaluation procedure consisting of 1) a cue-decomposition method that comprises two AI-free data pre-processing methods extracting shape and texture cues, respectively, and 2) a novel cue-decomposition shape bias evaluation metric that leverages the cue-decomposition data. For application purposes we introduce a corresponding cue-decomposition robustness metric that allows for the estimation of the robustness of a DNN w.r.t. image corruptions. In our numerical experiments, our findings for biases in image classification DNNs align with those of previous evaluation metrics. However, our cue-decomposition robustness metric shows superior results in terms of estimating the robustness of DNNs. Furthermore, our results for DNNs on the semantic segmentation datasets Cityscapes and ADE20k for the first time shed light into the biases of semantic segmentation DNNs.

Title: Learning Privacy from Visual Entities

Authors: Alessio Xompero (1), Andrea Cavallaro (2 and 3) ((1) Queen Mary University of London, (2) Idiap Research Institute, (3) École Polytechnique Fédérale de Lausanne)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12464
Pdf URL: https://arxiv.org/pdf/2503.12464
Copy Paste: [[2503.12464]] Learning Privacy from Visual Entities(https://arxiv.org/abs/2503.12464)
Keywords: privacy
Abstract: Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.

Title: DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement

Authors: Han Mei, Kunqian Li, Shuaixin Liu, Chengzhi Ma, Qianli Jiang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12470
Pdf URL: https://arxiv.org/pdf/2503.12470
Copy Paste: [[2503.12470]] DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement(https://arxiv.org/abs/2503.12470)
Keywords: robust
Abstract: Due to the complex interplay of light absorption and scattering in the underwater environment, underwater images experience significant degradation. This research presents a two-stage underwater image enhancement network called the Data-Driven and Physical Parameters Fusion Network (DPF-Net), which harnesses the robustness of physical imaging models alongside the generality and efficiency of data-driven methods. We first train a physical parameter estimate module using synthetic datasets to guarantee the trustworthiness of the physical parameters, rather than solely learning the fitting relationship between raw and reference images by the application of the imaging equation, as is common in prior studies. This module is subsequently trained in conjunction with an enhancement network, where the estimated physical parameters are integrated into a data-driven model within the embedding space. To maintain the uniformity of the restoration process amid underwater imaging degradation, we propose a physics-based degradation consistency loss. Additionally, we suggest an innovative weak reference loss term utilizing the entire dataset, which alleviates our model's reliance on the quality of individual reference images. Our proposed DPF-Net demonstrates superior performance compared to other benchmark methods across multiple test sets, achieving state-of-the-art results. The source code and pre-trained models are available on the project home page: this https URL.

Title: Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification

Authors: Wenbo Dai, Lijing Lu, Zhihang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12472
Pdf URL: https://arxiv.org/pdf/2503.12472
Copy Paste: [[2503.12472]] Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2503.12472)
Keywords: privacy, protect, diffusion
Abstract: The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: this https URL

Title: GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing

Authors: Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Bin Chen, Zian Guan, Yuhao Wang, Xu Jia, Yuxiang Cai, Yongheng Shang, Jianwei Yin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12490
Pdf URL: https://arxiv.org/pdf/2503.12490
Copy Paste: [[2503.12490]] GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing(https://arxiv.org/abs/2503.12490)
Keywords: large language model, segmentation
Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.

Title: CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

Authors: Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12491
Pdf URL: https://arxiv.org/pdf/2503.12491
Copy Paste: [[2503.12491]] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences(https://arxiv.org/abs/2503.12491)
Keywords: large language model
Abstract: Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at this https URL.

Title: Geometry-Aware Face Reconstruction Under Occluded Scenes

Authors: Dapeng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12492
Pdf URL: https://arxiv.org/pdf/2503.12492
Copy Paste: [[2503.12492]] Geometry-Aware Face Reconstruction Under Occluded Scenes(https://arxiv.org/abs/2503.12492)
Keywords: robust
Abstract: Recently, deep learning-based 3D face reconstruction methods have demonstrated promising advancements in terms of quality and efficiency. Nevertheless, these techniques face challenges in effectively handling occluded scenes and fail to capture intricate geometric facial details. Inspired by the principles of GANs and bump mapping, we have successfully addressed these issues. Our approach aims to deliver comprehensive 3D facial reconstructions, even in the presence of this http URL maintaining the overall shape's robustness, we introduce a mid-level shape refinement to the fundamental structure. Furthermore, we illustrate how our method adeptly extends to generate plausible details for obscured facial regions. We offer numerous examples that showcase the effectiveness of our framework in producing realistic results, where traditional methods often struggle. To substantiate the superior adaptability of our approach, we have conducted extensive experiments in the context of general 3D face reconstruction tasks, serving as concrete evidence of its regulatory prowess compared to manual occlusion removal methods.

Title: Learning Contour-Guided 3D Face Reconstruction with Occlusions

Authors: Dapeng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12494
Pdf URL: https://arxiv.org/pdf/2503.12494
Copy Paste: [[2503.12494]] Learning Contour-Guided 3D Face Reconstruction with Occlusions(https://arxiv.org/abs/2503.12494)
Keywords: robust
Abstract: Recently, deep learning-based 3D face reconstruction methods have demonstrated promising advancements in terms of quality and efficiency. Nevertheless, these techniques face challenges in effectively handling occluded scenes and fail to capture intricate geometric facial details. Inspired by the principles of GANs and bump mapping, we have successfully addressed these issues. Our approach aims to deliver comprehensive 3D facial reconstructions, even in the presence of this http URL maintaining the overall shape's robustness, we introduce a mid-level shape refinement to the fundamental structure. Furthermore, we illustrate how our method adeptly extends to generate plausible details for obscured facial regions. We offer numerous examples that showcase the effectiveness of our framework in producing realistic results, where traditional methods often struggle. To substantiate the superior adaptability of our approach, we have conducted extensive experiments in the context of general 3D face reconstruction tasks, serving as concrete evidence of its regulatory prowess compared to manual occlusion removal methods.

Title: Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy

Authors: Jian-Ping Mei, Weibin Zhang, Jie Chen, Xuyun Zhang, Tiantian Zhu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12497
Pdf URL: https://arxiv.org/pdf/2503.12497
Copy Paste: [[2503.12497]] Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy(https://arxiv.org/abs/2503.12497)
Keywords: protect, defense, attack, steal
Abstract: Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.

Title: Segment Any-Quality Images with Generative Latent Space Enhancement

Authors: Guangqian Guo, Yoong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12507
Pdf URL: https://arxiv.org/pdf/2503.12507
Copy Paste: [[2503.12507]] Segment Any-Quality Images with Generative Latent Space Enhancement(https://arxiv.org/abs/2503.12507)
Keywords: robust, diffusion, generative, segmentation
Abstract: Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

Title: AI-Powered Automated Model Construction for Patient-Specific CFD Simulations of Aortic Flows

Authors: Pan Du, Delin An, Chaoli Wang, Jian-Xun Wang
Subjects: cs.CV, cs.LG, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.12515
Pdf URL: https://arxiv.org/pdf/2503.12515
Copy Paste: [[2503.12515]] AI-Powered Automated Model Construction for Patient-Specific CFD Simulations of Aortic Flows(https://arxiv.org/abs/2503.12515)
Keywords: segmentation
Abstract: Image-based modeling is essential for understanding cardiovascular hemodynamics and advancing the diagnosis and treatment of cardiovascular diseases. Constructing patient-specific vascular models remains labor-intensive, error-prone, and time-consuming, limiting their clinical applications. This study introduces a deep-learning framework that automates the creation of simulation-ready vascular models from medical images. The framework integrates a segmentation module for accurate voxel-based vessel delineation with a surface deformation module that performs anatomically consistent and unsupervised surface refinements guided by medical image data. By unifying voxel segmentation and surface deformation into a single cohesive pipeline, the framework addresses key limitations of existing methods, enhancing geometric accuracy and computational efficiency. Evaluated on publicly available datasets, the proposed approach demonstrates state-of-the-art performance in segmentation and mesh quality while significantly reducing manual effort and processing time. This work advances the scalability and reliability of image-based computational modeling, facilitating broader applications in clinical and research settings.

Title: HyConEx: Hypernetwork classifier with counterfactual explanations

Authors: Patryk Marszałek, Ulvi Movsum-zada, Oleksii Furman, Kamil Książek, Przemysław Spurek, Marek Śmieja
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12525
Pdf URL: https://arxiv.org/pdf/2503.12525
Copy Paste: [[2503.12525]] HyConEx: Hypernetwork classifier with counterfactual explanations(https://arxiv.org/abs/2503.12525)
Keywords: attack, interpretability
Abstract: In recent years, there has been a growing interest in explainable AI methods. We want not only to make accurate predictions using sophisticated neural networks but also to understand what the model's decision is based on. One of the fundamental levels of interpretability is to provide counterfactual examples explaining the rationale behind the decision and identifying which features, and to what extent, must be modified to alter the model's outcome. To address these requirements, we introduce HyConEx, a classification model based on deep hypernetworks specifically designed for tabular data. Owing to its unique architecture, HyConEx not only provides class predictions but also delivers local interpretations for individual data samples in the form of counterfactual examples that steer a given sample toward an alternative class. While many explainable methods generated counterfactuals for external models, there have been no interpretable classifiers simultaneously producing counterfactual samples so far. HyConEx achieves competitive performance on several metrics assessing classification accuracy and fulfilling the criteria of a proper counterfactual attack. This makes HyConEx a distinctive deep learning model, which combines predictions and explainers as an all-in-one neural network. The code is available at this https URL.

Title: EditID: Training-Free Editable ID Customization for Text-to-Image Generation

Authors: Guandong Li, Zhaobin Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12526
Pdf URL: https://arxiv.org/pdf/2503.12526
Copy Paste: [[2503.12526]] EditID: Training-Free Editable ID Customization for Text-to-Image Generation(https://arxiv.org/abs/2503.12526)
Keywords: extraction
Abstract: We propose EditID, a training-free approach based on the DiT architecture, which achieves highly editable customized IDs for text to image generation. Existing text-to-image models for customized IDs typically focus more on ID consistency while neglecting editability. It is challenging to alter facial orientation, character attributes, and other features through prompts. EditID addresses this by deconstructing the text-to-image model for customized IDs into an image generation branch and a character feature branch. The character feature branch is further decoupled into three modules: feature extraction, feature fusion, and feature integration. By introducing a combination of mapping features and shift features, along with controlling the intensity of ID feature integration, EditID achieves semantic compression of local features across network depths, forming an editable feature space. This enables the successful generation of high-quality images with editable IDs while maintaining ID consistency, achieving excellent results in the IBench evaluation, which is an editability evaluation framework for the field of customized ID text-to-image generation that quantitatively demonstrates the superior performance of EditID. EditID is the first text-to-image solution to propose customizable ID editability on the DiT architecture, meeting the demands of long prompts and high quality image generation.

Title: A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry

Authors: Yang Yi, Kunqing Wang, Jinpu Zhang, Zhen Tan, Xiangke Wang, Hui Shen, Dewen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12527
Pdf URL: https://arxiv.org/pdf/2503.12527
Copy Paste: [[2503.12527]] A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry(https://arxiv.org/abs/2503.12527)
Keywords: robust
Abstract: The bias of low-cost Inertial Measurement Units (IMU) is a critical factor affecting the performance of Visual-Inertial Odometry (VIO). In particular, when visual tracking encounters errors, the optimized bias results may deviate significantly from the true values, adversely impacting the system's stability and localization precision. In this paper, we propose a novel plug-and-play framework featuring the Inertial Prior Network (IPNet), which is designed to accurately estimate IMU bias. Recognizing the substantial impact of initial bias errors in low-cost inertial devices on system performance, our network directly leverages raw IMU data to estimate the mean bias, eliminating the dependency on historical estimates in traditional recursive predictions and effectively preventing error propagation. Furthermore, we introduce an iterative approach to calculate the mean value of the bias for network training, addressing the lack of bias labels in many visual-inertial datasets. The framework is evaluated on two public datasets and one self-collected dataset. Extensive experiments demonstrate that our method significantly enhances both localization precision and robustness, with the ATE-RMSE metric improving on average by 46\%. The source code and video will be available at \textcolor{red}{this https URL}.

Title: Investigating Human-Aligned Large Language Model Uncertainty

Authors: Kyle Moore, Jesse Roberts, Daryl Watson, Pamela Wisniewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12528
Pdf URL: https://arxiv.org/pdf/2503.12528
Copy Paste: [[2503.12528]] Investigating Human-Aligned Large Language Model Uncertainty(https://arxiv.org/abs/2503.12528)
Keywords: large language model
Abstract: Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.

Title: Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks

Authors: Mehmet Kerem Turkcan, Mattia Ballo, Filippo Filicori, Zoran Kostic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12531
Pdf URL: https://arxiv.org/pdf/2503.12531
Copy Paste: [[2503.12531]] Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks(https://arxiv.org/abs/2503.12531)
Keywords: diffusion, generative
Abstract: We introduce specialized diffusion-based generative models that capture the spatiotemporal dynamics of fine-grained robotic surgical sub-stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data-driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of $\sim2K$ clips extracted from simulation videos, we categorize surgical actions into fine-grained sub-stitch classes including ideal and non-ideal executions of needle positioning, targeting, driving, and withdrawal. We fine-tune two state-of-the-art video diffusion models, LTX-Video and HunyuanVideo, to generate high-fidelity surgical action sequences at $\ge$768x512 resolution and $\ge$49 frames. For training our models, we explore both Low-Rank Adaptation (LoRA) and full-model fine-tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non-ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: this https URL

Title: Time-EAPCR-T: A Universal Deep Learning Approach for Anomaly Detection in Industrial Equipment

Authors: Huajie Liang, Di Wang, Yuchao Lu, Mengke Song, Lei Liu, Ling An, Ying Liang, Xingjie Ma, Zhenyu Zhang, Chichun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12534
Pdf URL: https://arxiv.org/pdf/2503.12534
Copy Paste: [[2503.12534]] Time-EAPCR-T: A Universal Deep Learning Approach for Anomaly Detection in Industrial Equipment(https://arxiv.org/abs/2503.12534)
Keywords: extraction, transformer
Abstract: With the advancement of Industry 4.0, intelligent manufacturing extensively employs sensors for real-time multidimensional data collection, playing a crucial role in equipment monitoring, process optimisation, and efficiency enhancement. Industrial data exhibit characteristics such as multi-source heterogeneity, nonlinearity, strong coupling, and temporal interactions, while also being affected by noise interference. These complexities make it challenging for traditional anomaly detection methods to extract key features, impacting detection accuracy and stability. Traditional machine learning approaches often struggle with such complex data due to limitations in processing capacity and generalisation ability, making them inadequate for practical applications. While deep learning feature extraction modules have demonstrated remarkable performance in image and text processing, they remain ineffective when applied to multi-source heterogeneous industrial data lacking explicit correlations. Moreover, existing multi-source heterogeneous data processing techniques still rely on dimensionality reduction and feature selection, which can lead to information loss and difficulty in capturing high-order interactions. To address these challenges, this study applies the EAPCR and Time-EAPCR models proposed in previous research and introduces a new model, Time-EAPCR-T, where Transformer replaces the LSTM module in the time-series processing component of Time-EAPCR. This modification effectively addresses multi-source data heterogeneity, facilitates efficient multi-source feature fusion, and enhances the temporal feature extraction capabilities of multi-source industrial this http URL results demonstrate that the proposed method outperforms existing approaches across four industrial datasets, highlighting its broad application potential.

Title: SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Authors: Guibiao Liao, Qing Li, Zhenyu Bao, Guoping Qiu, Kanglin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12535
Pdf URL: https://arxiv.org/pdf/2503.12535
Copy Paste: [[2503.12535]] SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs(https://arxiv.org/abs/2503.12535)
Keywords: segmentation
Abstract: 3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.

Title: Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model

Authors: Lin-Chun Huang, Ching Chieh Tsao, Fang-Yi Su, Jung-Hsien Chiang
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.12536
Pdf URL: https://arxiv.org/pdf/2503.12536
Copy Paste: [[2503.12536]] Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model(https://arxiv.org/abs/2503.12536)
Keywords: fair, diffusion, generative, large language model
Abstract: Image generative models, particularly diffusion-based models, have surged in popularity due to their remarkable ability to synthesize highly realistic images. However, since these models are data-driven, they inherit biases from the training datasets, frequently leading to disproportionate group representations that exacerbate societal inequities. Traditionally, efforts to debiase these models have relied on predefined sensitive attributes, classifiers trained on such attributes, or large language models to steer outputs toward fairness. However, these approaches face notable drawbacks: predefined attributes do not adequately capture complex and continuous variations among groups. To address these issues, we introduce the Debiasing Diffusion Model (DDM), which leverages an indicator to learn latent representations during training, promoting fairness through balanced representations without requiring predefined sensitive attributes. This approach not only demonstrates its effectiveness in scenarios previously addressed by conventional techniques but also enhances fairness without relying on predefined sensitive attributes as conditions. In this paper, we discuss the limitations of prior bias mitigation techniques in diffusion-based models, elaborate on the architecture of the DDM, and validate the effectiveness of our approach through experiments.

Title: BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis

Authors: Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, Kaizhu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12539
Pdf URL: https://arxiv.org/pdf/2503.12539
Copy Paste: [[2503.12539]] BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis(https://arxiv.org/abs/2503.12539)
Keywords: segmentation
Abstract: 3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. Code is available at this https URL.

Title: ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Authors: Peiran Wu, Yunze Liu, Chonghan Liu, Miao Liu, Junxiao Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12542
Pdf URL: https://arxiv.org/pdf/2503.12542
Copy Paste: [[2503.12542]] ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos(https://arxiv.org/abs/2503.12542)
Keywords: large language model
Abstract: Humans excel at spatio-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly comprehend the 4D world remains uncertain. This paper explores multimodal spatio-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities. To support this objective, we introduce Ego-ST Bench, a novel benchmark containing over 5,000 question-answer pairs across four categories, systematically evaluating spatial, temporal, and integrated spatio-temporal reasoning. Additionally, we propose the ST-R1 Video model, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning, achieving notable improvements with limited high-quality data. Ego-ST Bench and ST-R1 provide valuable insights and resources for advancing video-based spatio-temporal reasoning research.

Title: PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

Authors: Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Xiaojiang Peng, Kai Wang, Yang You, Wenqi Shao, Hongxun Yao, Kaipeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12545
Pdf URL: https://arxiv.org/pdf/2503.12545
Copy Paste: [[2503.12545]] PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models(https://arxiv.org/abs/2503.12545)
Keywords: secure, security, privacy, robust, large language model
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning. However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security. To address these issues, machine unlearning (MU) has emerged as a promising solution, enabling the removal of specific knowledge from an already trained model without requiring retraining from scratch. Although MU for MLLMs has gained attention, current evaluations of its efficacy remain incomplete, and the underlying problem is often poorly defined, which hinders the development of strategies for creating more secure and trustworthy systems. To bridge this gap, we introduce a benchmark, named PEBench, which includes a dataset of personal entities and corresponding general event scenes, designed to comprehensively assess the performance of MU for MLLMs. Through PEBench, we aim to provide a standardized and robust framework to advance research in secure and privacy-preserving multimodal models. We benchmarked 6 MU methods, revealing their strengths and limitations, and shedding light on key challenges and opportunities for MU in MLLMs.

Title: From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations

Authors: Sarvesh Baskar, Tanmay Tulsidas Verelakar, Srinivasan Parthasarathy, Manas Gaur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12556
Pdf URL: https://arxiv.org/pdf/2503.12556
Copy Paste: [[2503.12556]] From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations(https://arxiv.org/abs/2503.12556)
Keywords: extraction, large language model
Abstract: In multi-turn dialogues, large language models (LLM) face a critical challenge of ensuring coherence while adapting to user-specific information. This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations. While prior research has recognized these gaps, computational methods for their identification and resolution remain underexplored. We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps using intrinsic uncertainty quantification and feedback-driven refinement. CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context. We evaluate CPER on two real-world datasets: CCPE-M for preferential movie recommendations and ESConv for mental health support. Using A/B testing, human evaluators preferred CPER's responses 42% more often than baseline models in CCPE-M and 27% more often in ESConv. A qualitative human evaluation confirms that CPER's responses are preferred for maintaining contextual relevance and coherence, particularly in longer (12+ turn) conversations.

Title: AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

Authors: Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2503.12559
Pdf URL: https://arxiv.org/pdf/2503.12559
Copy Paste: [[2503.12559]] AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding(https://arxiv.org/abs/2503.12559)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at this https URL.

Title: Multi-Granular Multimodal Clue Fusion for Meme Understanding

Authors: Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, Donghong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12560
Pdf URL: https://arxiv.org/pdf/2503.12560
Copy Paste: [[2503.12560]] Multi-Granular Multimodal Clue Fusion for Meme Understanding(https://arxiv.org/abs/2503.12560)
Keywords: extraction
Abstract: With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.

Title: Diffusion on Graph: Augmentation of Graph Structure for Node Classification

Authors: Yancheng Wang, Changyu Liu, Yingzhen Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12563
Pdf URL: https://arxiv.org/pdf/2503.12563
Copy Paste: [[2503.12563]] Diffusion on Graph: Augmentation of Graph Structure for Node Classification(https://arxiv.org/abs/2503.12563)
Keywords: diffusion
Abstract: Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at this https URL.

Title: GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack

Authors: Abyad Enan, Mashrur Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12567
Pdf URL: https://arxiv.org/pdf/2503.12567
Copy Paste: [[2503.12567]] GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack(https://arxiv.org/abs/2503.12567)
Keywords: security, defense, attack, generative
Abstract: Computer Vision plays a critical role in ensuring the safe navigation of autonomous vehicles (AVs). An AV perception module is responsible for capturing and interpreting the surrounding environment to facilitate safe navigation. This module enables AVs to recognize traffic signs, traffic lights, and various road users. However, the perception module is vulnerable to adversarial attacks, which can compromise their accuracy and reliability. One such attack is the adversarial patch attack (APA), a physical attack in which an adversary strategically places a specially crafted sticker on an object to deceive object classifiers. In APA, an adversarial patch is positioned on a target object, leading the classifier to misidentify it. Such an APA can cause AVs to misclassify traffic signs, leading to catastrophic incidents. To enhance the security of an AV perception system against APAs, this study develops a Generative Adversarial Network (GAN)-based single-stage defense strategy for traffic sign classification. This approach is tailored to defend against APAs on different classes of traffic signs without prior knowledge of a patch's design. This study found this approach to be effective against patches of varying sizes. Our experimental analysis demonstrates that the defense strategy presented in this paper improves the classifier's accuracy under APA conditions by up to 80.8% and enhances overall classification accuracy for all the traffic signs considered in this study by 58%, compared to a classifier without any defense mechanism. Our defense strategy is model-agnostic, making it applicable to any traffic sign classifier, regardless of the underlying classification model.

Title: Deblur Gaussian Splatting SLAM

Authors: Francesco Girlanda, Denys Rozumnyi, Marc Pollefeys, Martin R. Oswald
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12572
Pdf URL: https://arxiv.org/pdf/2503.12572
Copy Paste: [[2503.12572]] Deblur Gaussian Splatting SLAM(https://arxiv.org/abs/2503.12572)
Keywords: robust
Abstract: We present Deblur-SLAM, a robust RGB SLAM pipeline designed to recover sharp reconstructions from motion-blurred inputs. The proposed method bridges the strengths of both frame-to-frame and frame-to-model approaches to model sub-frame camera trajectories that lead to high-fidelity reconstructions in motion-blurred settings. Moreover, our pipeline incorporates techniques such as online loop closure and global bundle adjustment to achieve a dense and precise global trajectory. We model the physical image formation process of motion-blurred images and minimize the error between the observed blurry images and rendered blurry images obtained by averaging sharp virtual sub-frame images. Additionally, by utilizing a monocular depth estimator alongside the online deformation of Gaussians, we ensure precise mapping and enhanced image deblurring. The proposed SLAM pipeline integrates all these components to improve the results. We achieve state-of-the-art results for sharp map estimation and sub-frame trajectory recovery both on synthetic and real-world blurry input data.

Title: BalancedDPO: Adaptive Multi-Metric Alignment

Authors: Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, Vaneet Aggarwal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12575
Pdf URL: https://arxiv.org/pdf/2503.12575
Copy Paste: [[2503.12575]] BalancedDPO: Adaptive Multi-Metric Alignment(https://arxiv.org/abs/2503.12575)
Keywords: robust, diffusion
Abstract: Text-to-image (T2I) diffusion models have made remarkable advancements, yet aligning them with diverse preferences remains a persistent challenge. Current methods often optimize single metrics or depend on narrowly curated datasets, leading to overfitting and limited generalization across key visual quality metrics. We present BalancedDPO, a novel extension of Direct Preference Optimization (DPO) that addresses these limitations by simultaneously aligning T2I diffusion models with multiple metrics, including human preference, CLIP score, and aesthetic quality. Our key novelty lies in aggregating consensus labels from diverse metrics in the preference distribution space as compared to existing reward mixing approaches, enabling robust and scalable multi-metric alignment while maintaining the simplicity of the standard DPO pipeline that we refer to as BalancedDPO. Our evaluations on the Pick-a-Pic, PartiPrompt and HPD datasets show that BalancedDPO achieves state-of-the-art results, outperforming existing approaches across all major metrics. BalancedDPO improves the average win rates by 15%, 7.1%, and 10.3% on Pick-a-pic, PartiPrompt and HPD, respectively, from the DiffusionDPO.

Title: RaSA: Rank-Sharing Low-Rank Adaptation

Authors: Zhiwei He, Zhaopeng Tu, Xing Wang, Xingyu Chen, Zhijie Wang, Jiahao Xu, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12576
Pdf URL: https://arxiv.org/pdf/2503.12576
Copy Paste: [[2503.12576]] RaSA: Rank-Sharing Low-Rank Adaptation(https://arxiv.org/abs/2503.12576)
Keywords: large language model
Abstract: Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: this https URL.

Title: Personalize Anything for Free with Diffusion Transformer

Authors: Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12590
Pdf URL: https://arxiv.org/pdf/2503.12590
Copy Paste: [[2503.12590]] Personalize Anything for Free with Diffusion Transformer(https://arxiv.org/abs/2503.12590)
Keywords: diffusion, transformer
Abstract: Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.

Title: MoECollab: Democratizing LLM Development Through Collaborative Mixture of Experts

Authors: Harshit
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.12592
Pdf URL: https://arxiv.org/pdf/2503.12592
Copy Paste: [[2503.12592]] MoECollab: Democratizing LLM Development Through Collaborative Mixture of Experts(https://arxiv.org/abs/2503.12592)
Keywords: large language model
Abstract: Large Language Model (LLM) development has become increasingly centralized, limiting participation to well-resourced organizations. This paper introduces MoECollab, a novel framework leveraging Mixture of Experts (MoE) architecture to enable distributed, collaborative LLM development. By decomposing monolithic models into specialized expert modules coordinated by a trainable gating network, our framework allows diverse contributors to participate regardless of computational resources. We provide a complete technical implementation with mathematical foundations for expert dynamics, gating mechanisms, and integration strategies. Experiments on multiple datasets demonstrate that our approach achieves accuracy improvements of 3-7% over baseline models while reducing computational requirements by 34%. Expert specialization yields significant domain-specific gains, with improvements from 51% to 88% F1 score in general classification and from 23% to 44% accuracy in news categorization. We formalize the routing entropy optimization problem and demonstrate how proper regularization techniques lead to 14% higher expert utilization rates. These results validate MoECollab as an effective approach for democratizing LLM development through architecturally-supported collaboration.

Title: Point Cloud Based Scene Segmentation: A Survey

Authors: Dan Halperin, Niklas Eisl
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12595
Pdf URL: https://arxiv.org/pdf/2503.12595
Copy Paste: [[2503.12595]] Point Cloud Based Scene Segmentation: A Survey(https://arxiv.org/abs/2503.12595)
Keywords: segmentation
Abstract: Autonomous driving is a safety-critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state-of-the-art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection-based, 3D-based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real-world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.

Title: GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation

Authors: Tao Feng, Yihang Sun, Jiaxuan You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12600
Pdf URL: https://arxiv.org/pdf/2503.12600
Copy Paste: [[2503.12600]] GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation(https://arxiv.org/abs/2503.12600)
Keywords: robust, extraction, large language model
Abstract: The powerful capabilities of Large Language Models (LLMs) have led to their growing use in evaluating human-generated content, particularly in evaluating research ideas within academic settings. Existing solutions primarily rely on prompt-based LLM methods or fine-tuned lightweight language models for idea evaluation. However, these methods are often unstable and struggle to comprehend the complex semantic information embedded in the ideas, impeding their ability to perform high-quality evaluations. To address the above challenges, we propose GraphEval, a lightweight graph-based LLM framework for idea evaluation. Our insight is that a complex idea can be broken down into comprehensible viewpoint nodes using prompts from small LLMs. These viewpoint nodes can then be linked together through edges created from LLM-based relation extraction and/or BERT similarity scores. The created viewpoint-graph can be used to conveniently propagate scores across view-nodes to improve the robustness of the idea evaluations. In particular, we propose two lightweight graph-based methods for idea evaluation: (1) GraphEval-LP: a training-free label propagation algorithm that propagates evaluation scores from known view-nodes to unknown nodes; (2) GraphEval-GNN: a Graph Neural Networks (GNN) that is trained to predict the evaluation scores given the observed graph with minimal computation resources. Moreover, to overcome LLM's limitation in objectively assessing the novelty of ideas, we further propose a novelty detection model to GraphEval-GNN to enhance its capability in judging idea novelty. Experiments on two datasets show GraphEval improves F1 scores by at least 14% with low computation and API costs. Additionally, GraphEval can effectively detect plagiarized ideas.

Title: SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Authors: Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Andrew Gritsevskiy, Teresa Head-Gordon
Subjects: cs.LG, physics.bio-ph
Abstract URL: https://arxiv.org/abs/2503.12602
Pdf URL: https://arxiv.org/pdf/2503.12602
Copy Paste: [[2503.12602]] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models(https://arxiv.org/abs/2503.12602)
Keywords: robust, generative, large language model
Abstract: Generative machine learning models for small molecule drug discovery have shown immense promise, but many molecules generated by this approach are too difficult to synthesize to be worth further investigation or further development. We present a novel approach by fine-tuning Meta's Llama3 large language models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible Enamine building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data compared to other state-of-the-art methods, and offers strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion, offering medicinal chemists a valuable tool for drug discovery developments. We find that SynLlama can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data.

Title: Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Authors: Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12605
Pdf URL: https://arxiv.org/pdf/2503.12605
Copy Paste: [[2503.12605]] Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey(https://arxiv.org/abs/2503.12605)
Keywords: large language model
Abstract: By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

Title: UniBERTs: Adversarial Training for Language-Universal Representations

Authors: Andrei-Marius Avram, Marian Lupaşcu, Dumitru-Clementin Cercel, Ionuţ Mironică, Ştefan Trăuşan-Matu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12608
Pdf URL: https://arxiv.org/pdf/2503.12608
Copy Paste: [[2503.12608]] UniBERTs: Adversarial Training for Language-Universal Representations(https://arxiv.org/abs/2503.12608)
Keywords: robust
Abstract: This paper presents UniBERT, a compact multilingual language model that leverages an innovative training framework integrating three components: masked language modeling, adversarial training, and knowledge distillation. Pre-trained on a meticulously curated Wikipedia corpus spanning 107 languages, UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks. Comprehensive evaluations on four tasks -- named entity recognition, natural language inference, question answering, and semantic textual similarity -- demonstrate that our multilingual training strategy enhanced by an adversarial objective significantly improves cross-lingual generalization. Specifically, UniBERT models show an average relative improvement of 7.72% over traditional baselines, which achieved an average relative improvement of only 1.17%, with statistical analysis confirming the significance of these gains (p-value = 0.0181). This work highlights the benefits of combining adversarial training and knowledge distillation to build scalable and robust language models, thereby advancing the field of multilingual and cross-lingual natural language processing.

Title: LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Authors: Alessio Spagnoletti, Jean Prost, Andrés Almansa, Nicolas Papadakis, Marcelo Pereyra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12615
Pdf URL: https://arxiv.org/pdf/2503.12615
Copy Paste: [[2503.12615]] LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization(https://arxiv.org/abs/2503.12615)
Keywords: diffusion, generative
Abstract: Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency.

Title: Scaling Semantic Categories: Investigating the Impact on Vision Transformer Labeling Performance

Authors: Anthony Lamelas, Harrison Muchnic
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12617
Pdf URL: https://arxiv.org/pdf/2503.12617
Copy Paste: [[2503.12617]] Scaling Semantic Categories: Investigating the Impact on Vision Transformer Labeling Performance(https://arxiv.org/abs/2503.12617)
Keywords: transformer
Abstract: This study explores the impact of scaling semantic categories on the image classification performance of vision transformers (ViTs). In this specific case, the CLIP server provided by Jina AI is used for experimentation. The research hypothesizes that as the number of ground truth and artificially introduced semantically equivalent categories increases, the labeling accuracy of ViTs improves until a theoretical maximum or limit is reached. A wide variety of image datasets were chosen to test this hypothesis. These datasets were processed through a custom function in Python designed to evaluate the model's accuracy, with adjustments being made to account for format differences between datasets. By exponentially introducing new redundant categories, the experiment assessed accuracy trends until they plateaued, decreased, or fluctuated inconsistently. The findings show that while semantic scaling initially increases model performance, the benefits diminish or reverse after surpassing a critical threshold, providing insight into the limitations and possible optimization of category labeling strategies for ViTs.

Title: SCOOP: CoSt-effective COngestiOn Attacks in Payment Channel Networks

Authors: Mohammed Ababneh, Kartick Kolachala, Roopa Vishwanathan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.12625
Pdf URL: https://arxiv.org/pdf/2503.12625
Copy Paste: [[2503.12625]] SCOOP: CoSt-effective COngestiOn Attacks in Payment Channel Networks(https://arxiv.org/abs/2503.12625)
Keywords: security, attack
Abstract: Payment channel networks (PCNs) are a promising solution to address blockchain scalability and throughput challenges, However, the security of PCNs and their vulnerability to attacks are not sufficiently studied. In this paper, we introduce SCOOP, a framework that includes two novel congestion attacks on PCNs. These attacks consider the minimum transferable amount along a path (path capacity) and the number of channels involved (path length), formulated as linear optimization problems. The first attack allocates the attacker's budget to achieve a specific congestion threshold, while the second maximizes congestion under budget constraints. Simulation results show the effectiveness of the proposed attack formulations in comparison to other attack strategies. Specifically, the results indicate that the first attack provides around a 40\% improvement in congestion performance, while the second attack offers approximately a 50\% improvement in comparison to the state-of-the-art. Moreover, in terms of payment to congestion efficiency, the first attack is about 60\% more efficient, and the second attack is around 90\% more efficient in comparison to state-of-the-art

Title: FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12649
Pdf URL: https://arxiv.org/pdf/2503.12649
Copy Paste: [[2503.12649]] FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization(https://arxiv.org/abs/2503.12649)
Keywords: data-free
Abstract: Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models.

Title: UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Authors: Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12652
Pdf URL: https://arxiv.org/pdf/2503.12652
Copy Paste: [[2503.12652]] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing(https://arxiv.org/abs/2503.12652)
Keywords: diffusion, segmentation
Abstract: Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Title: TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems

Authors: Arthur Corrêa, Cristóvão Silva, Liming Xu, Alexandra Brintrup, Samuel Moniz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12662
Pdf URL: https://arxiv.org/pdf/2503.12662
Copy Paste: [[2503.12662]] TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems(https://arxiv.org/abs/2503.12662)
Keywords: transformer
Abstract: This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing different variants of vehicle routing problems (VRP). Recently, multi-task learning has gained much attention for solving VRP variants. However, this adaptability often compromises the performance of the models. To address this challenge, we first pre-train a reinforcement learning model on the multi-depot VRP, followed by a short fine-tuning phase to adapt it to different variants. By leveraging the complexity of the multi-depot VRP, the pre-trained model learns richer node representations and gains more transferable knowledge compared to models trained on simpler routing problems, such as the traveling salesman problem. TuneNSearch employs, in the first stage, a Transformer-based architecture, augmented with a residual edge-graph attention network to capture the impact of edge distances and residual connections between layers. This architecture allows for a more precise capture of graph-structured data, improving the encoding of VRP's features. After inference, our model is also coupled with a second stage composed of a local search algorithm, which yields substantial performance gains with minimal computational overhead added. Results show that TuneNSearch outperforms many existing state-of-the-art models trained for each VRP variant, requiring only one-fifth of the training epochs. Our approach demonstrates strong generalization, achieving high performance across different tasks, distributions and problem sizes, thus addressing a long-standing gap in the literature.

Title: Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

Authors: Imran Kabir, Md Alimoor Reza, Syed Billah
Subjects: cs.CV, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12663
Pdf URL: https://arxiv.org/pdf/2503.12663
Copy Paste: [[2503.12663]] Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding(https://arxiv.org/abs/2503.12663)
Keywords: interpretability
Abstract: Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at this https URL.

Title: Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility

Authors: Jacob Chmura, Jonah Dauvet, Sebastian Sabry
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12667
Pdf URL: https://arxiv.org/pdf/2503.12667
Copy Paste: [[2503.12667]] Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility(https://arxiv.org/abs/2503.12667)
Keywords: large language model
Abstract: Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at this https URL.

Title: ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2503.12668
Pdf URL: https://arxiv.org/pdf/2503.12668
Copy Paste: [[2503.12668]] ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory(https://arxiv.org/abs/2503.12668)
Keywords: large language model
Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in this https URL.

Title: Domain Generalization for Improved Human Activity Recognition in Office Space Videos Using Adaptive Pre-processing

Authors: Partho Ghosh, Raisa Bentay Hossain, Mohammad Zunaed, Taufiq Hasan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12678
Pdf URL: https://arxiv.org/pdf/2503.12678
Copy Paste: [[2503.12678]] Domain Generalization for Improved Human Activity Recognition in Office Space Videos Using Adaptive Pre-processing(https://arxiv.org/abs/2503.12678)
Keywords: robust
Abstract: Automatic video activity recognition is crucial across numerous domains like surveillance, healthcare, and robotics. However, recognizing human activities from video data becomes challenging when training and test data stem from diverse domains. Domain generalization, adapting to unforeseen domains, is thus essential. This paper focuses on office activity recognition amidst environmental variability. We propose three pre-processing techniques applicable to any video encoder, enhancing robustness against environmental variations. Our study showcases the efficacy of MViT, a leading state-of-the-art video classification model, and other video encoders combined with our techniques, outperforming state-of-the-art domain adaptation methods. Our approach significantly boosts accuracy, precision, recall and F1 score on unseen domains, emphasizing its adaptability in real-world scenarios with diverse video data sources. This method lays a foundation for more reliable video activity recognition systems across heterogeneous data domains.

Title: Algebraic Adversarial Attacks on Explainability Models

Authors: Lachlan Simpson, Federico Costanza, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew
Subjects: cs.LG, math.GR
Abstract URL: https://arxiv.org/abs/2503.12683
Pdf URL: https://arxiv.org/pdf/2503.12683
Copy Paste: [[2503.12683]] Algebraic Adversarial Attacks on Explainability Models(https://arxiv.org/abs/2503.12683)
Keywords: attack, explainability
Abstract: Classical adversarial attacks are phrased as a constrained optimisation problem. Despite the efficacy of a constrained optimisation approach to adversarial attacks, one cannot trace how an adversarial point was generated. In this work, we propose an algebraic approach to adversarial attacks and study the conditions under which one can generate adversarial examples for post-hoc explainability models. Phrasing neural networks in the framework of geometric deep learning, algebraic adversarial attacks are constructed through analysis of the symmetry groups of neural networks. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples. We validate our approach of algebraic adversarial examples on two well-known and one real-world dataset.

Title: GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching

Authors: Feng Qiao, Zhexiao Xiong, Eric Xing, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12720
Pdf URL: https://arxiv.org/pdf/2503.12720
Copy Paste: [[2503.12720]] GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching(https://arxiv.org/abs/2503.12720)
Keywords: diffusion
Abstract: Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. Project page is available at this https URL

Title: TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Authors: Philip Quirke, Clement Neo, Abir Harrasse, Dhruv Nathawani, Amir Abdullah
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.12730
Pdf URL: https://arxiv.org/pdf/2503.12730
Copy Paste: [[2503.12730]] TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research(https://arxiv.org/abs/2503.12730)
Keywords: interpretability
Abstract: Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including edge attribution patching and sparse autoencoders, to identify minimal circuits and components supporting SQL generation. Our analysis reveals both the potential and limitations of current interpretability methods, showing how circuits can vary even across similar queries. Lastly, we demonstrate how mechanistic interpretability can identify flawed heuristics in models and improve synthetic dataset design. Our work provides a comprehensive framework for evaluating and advancing interpretability techniques while establishing clear boundaries for their reliable application.

Title: A Linearized Alternating Direction Multiplier Method for Federated Matrix Completion Problems

Authors: Patrick Hytla, ran T. A. Nghia, Duy Nhat Phan, Andrew Rice
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12733
Pdf URL: https://arxiv.org/pdf/2503.12733
Copy Paste: [[2503.12733]] A Linearized Alternating Direction Multiplier Method for Federated Matrix Completion Problems(https://arxiv.org/abs/2503.12733)
Keywords: privacy, federate
Abstract: Matrix completion is fundamental for predicting missing data with a wide range of applications in personalized healthcare, e-commerce, recommendation systems, and social network analysis. Traditional matrix completion approaches typically assume centralized data storage, which raises challenges in terms of computational efficiency, scalability, and user privacy. In this paper, we address the problem of federated matrix completion, focusing on scenarios where user-specific data is distributed across multiple clients, and privacy constraints are uncompromising. Federated learning provides a promising framework to address these challenges by enabling collaborative learning across distributed datasets without sharing raw data. We propose \texttt{FedMC-ADMM} for solving federated matrix completion problems, a novel algorithmic framework that combines the Alternating Direction Method of Multipliers with a randomized block-coordinate strategy and alternating proximal gradient steps. Unlike existing federated approaches, \texttt{FedMC-ADMM} effectively handles multi-block nonconvex and nonsmooth optimization problems, allowing efficient computation while preserving user privacy. We analyze the theoretical properties of our algorithm, demonstrating subsequential convergence and establishing a convergence rate of $\mathcal{O}(K^{-1/2})$, leading to a communication complexity of $\mathcal{O}(\epsilon^{-2})$ for reaching an $\epsilon$-stationary point. This work is the first to establish these theoretical guarantees for federated matrix completion in the presence of multi-block variables. To validate our approach, we conduct extensive experiments on real-world datasets, including MovieLens 1M, 10M, and Netflix. The results demonstrate that \texttt{FedMC-ADMM} outperforms existing methods in terms of convergence speed and testing accuracy.

Title: In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Authors: Jianliang He, Xintian Pan, Siyu Chen, Zhuoran Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12734
Pdf URL: https://arxiv.org/pdf/2503.12734
Copy Paste: [[2503.12734]] In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention(https://arxiv.org/abs/2503.12734)
Keywords: interpretability, transformer
Abstract: We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Title: Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection

Authors: Longfei Wei, Fang Sheng, Jianfei Zhang
Subjects: cs.LG, cs.CE, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12755
Pdf URL: https://arxiv.org/pdf/2503.12755
Copy Paste: [[2503.12755]] Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection(https://arxiv.org/abs/2503.12755)
Keywords: robust, fair, interpretability
Abstract: Artificial intelligence (AI) has significantly improved medical screening accuracy, particularly in cancer detection and risk assessment. However, traditional classification metrics often fail to account for imbalanced data, varying performance across cohorts, and patient-level inconsistencies, leading to biased evaluations. We propose the Cohort-Attention Evaluation Metrics (CAT) framework to address these challenges. CAT introduces patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity. Key metrics like CATSensitivity (CATSen), CATSpecificity (CATSpe), and CATMean ensure balanced and fair evaluation across diverse populations. This approach enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.

Title: VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

Authors: Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12758
Pdf URL: https://arxiv.org/pdf/2503.12758
Copy Paste: [[2503.12758]] VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis(https://arxiv.org/abs/2503.12758)
Keywords: diffusion, generative
Abstract: Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.

Title: RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Authors: Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, Tong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12759
Pdf URL: https://arxiv.org/pdf/2503.12759
Copy Paste: [[2503.12759]] RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning(https://arxiv.org/abs/2503.12759)
Keywords: generative
Abstract: Recent research highlights the challenges retrieval models face in retrieving useful contexts and the limitations of generation models in effectively utilizing those contexts in retrieval-augmented generation (RAG) settings. To address these challenges, we introduce RAG-RL, the first reasoning language model (RLM) specifically trained for RAG. RAG-RL demonstrates that stronger answer generation models can identify relevant contexts within larger sets of retrieved information -- thereby alleviating the burden on retrievers -- while also being able to utilize those contexts more effectively. Moreover, we show that curriculum design in the reinforcement learning (RL) post-training process is a powerful approach to enhancing model performance. We benchmark our method on two open-domain question-answering datasets and achieve state-of-the-art results, surpassing previous SOTA generative reader models. In addition, we offers empirical insights into various curriculum learning strategies, providing a deeper understanding of their impact on model performance.

Title: A Survey on Human Interaction Motion Generation

Authors: Kewei Sui, Anindita Ghosh, Inwoo Hwang, Jian Wang, Chuan Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12763
Pdf URL: https://arxiv.org/pdf/2503.12763
Copy Paste: [[2503.12763]] A Survey on Human Interaction Motion Generation(https://arxiv.org/abs/2503.12763)
Keywords: generative
Abstract: Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.

Title: Asynchronous Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games

Authors: Linjian Meng, Youzhi Zhang, Zhenxing Ge, Tianpei Yang, Yang Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12770
Pdf URL: https://arxiv.org/pdf/2503.12770
Copy Paste: [[2503.12770]] Asynchronous Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games(https://arxiv.org/abs/2503.12770)
Keywords: robust
Abstract: Counterfactual Regret Minimization (CFR) algorithms are widely used to compute a Nash equilibrium (NE) in two-player zero-sum imperfect-information extensive-form games (IIGs). Among them, Predictive CFR$^+$ (PCFR$^+$) is particularly powerful, achieving an exceptionally fast empirical convergence rate via the prediction in many games. However, the empirical convergence rate of PCFR$^+$ would significantly degrade if the prediction is inaccurate, leading to unstable performance on certain IIGs. To enhance the robustness of PCFR$^+$, we propose a novel variant, Asynchronous PCFR$^+$ (APCFR$^+$), which employs an adaptive asynchronization of step-sizes between the updates of implicit and explicit accumulated counterfactual regrets to mitigate the impact of the prediction inaccuracy on convergence. We present a theoretical analysis demonstrating why APCFR$^+$ can enhance the robustness. Finally, we propose a simplified version of APCFR$^+$ called Simple APCFR$^+$ (SAPCFR$^+$), which uses a fixed asynchronization of step-sizes to simplify the implementation that only needs a single-line modification of the original PCFR+. Interestingly, SAPCFR$^+$ achieves a constant-factor lower theoretical regret bound than PCFR$^+$ in the worst case. Experimental results demonstrate that (i) both APCFR$^+$ and SAPCFR$^+$ outperform PCFR$^+$ in most of the tested games, as well as (ii) SAPCFR$^+$ achieves a comparable empirical convergence rate with APCFR$^+$.

Title: NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Authors: Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12772
Pdf URL: https://arxiv.org/pdf/2503.12772
Copy Paste: [[2503.12772]] NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models(https://arxiv.org/abs/2503.12772)
Keywords: large language model
Abstract: Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at this https URL.

Title: TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

Authors: Haoxiao Wang, Kaichen Zhou, Binrui Gu, Zhiyuan Feng, Weijie Wang, Peilin Sun, Yicheng Xiao, Jianhua Zhang, Hao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12779
Pdf URL: https://arxiv.org/pdf/2503.12779
Copy Paste: [[2503.12779]] TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image(https://arxiv.org/abs/2503.12779)
Keywords: diffusion, segmentation
Abstract: Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on this https URL

Title: LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

Authors: Chang Liu, Bavesh Balaji, Saad Hossain, C Thomas, Kwei-Herng Lai, Raviteja Vemulapalli, Alexander Wong, Sirisha Rambhatla
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12780
Pdf URL: https://arxiv.org/pdf/2503.12780
Copy Paste: [[2503.12780]] LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2503.12780)
Keywords: segmentation
Abstract: Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

Title: SAM2 for Image and Video Segmentation: A Comprehensive Survey

Authors: Zhang Jiaxing, Tang Hao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12781
Pdf URL: https://arxiv.org/pdf/2503.12781
Copy Paste: [[2503.12781]] SAM2 for Image and Video Segmentation: A Comprehensive Survey(https://arxiv.org/abs/2503.12781)
Keywords: segmentation
Abstract: Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.

Title: Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction

Authors: Jianan Li, Huan Chen, Wangcai Zhao, Rui Chen, Tingfa Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12783
Pdf URL: https://arxiv.org/pdf/2503.12783
Copy Paste: [[2503.12783]] Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction(https://arxiv.org/abs/2503.12783)
Keywords: extraction
Abstract: Hyperspectral Images (HSIs) are crucial across numerous fields but are hindered by the long acquisition times associated with traditional spectrometers. The Coded Aperture Snapshot Spectral Imaging (CASSI) system mitigates this issue through a compression technique that accelerates the acquisition process. However, reconstructing HSIs from compressed data presents challenges due to fixed spatial and spectral resolution constraints. This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction. We propose the Mixed Granularity Implicit Representation (MGIR) framework, which includes a Hierarchical Spectral-Spatial Implicit Encoder for efficient multi-scale implicit feature extraction. This is complemented by a Mixed-Granularity Local Feature Aggregator that adaptively integrates local features across scales, combined with a decoder that merges coordinate information for precise reconstruction. By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution, significantly enhancing the flexibility and adaptability of the CASSI system. Extensive experimental evaluations confirm that our model produces reconstructed images at arbitrary resolutions and matches state-of-the-art methods across varying spectral-spatial compression ratios. The code will be released at this https URL.

Title: Privacy-Preserving Biometric Verification with Handwritten Random Digit String

Authors: Peirong Zhang, Yuliang Liu, Songxuan Lai, Hongliang Li, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12786
Pdf URL: https://arxiv.org/pdf/2503.12786
Copy Paste: [[2503.12786]] Privacy-Preserving Biometric Verification with Handwritten Random Digit String(https://arxiv.org/abs/2503.12786)
Keywords: privacy, protect, attack, biometric
Abstract: Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.

Title: Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization

Authors: Yechao Zhang, Yingzhe Xu, Junyu Shi, Leo Yu Zhang, Shengshan Hu, Minghui Li, Yanjun Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.12793
Pdf URL: https://arxiv.org/pdf/2503.12793
Copy Paste: [[2503.12793]] Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization(https://arxiv.org/abs/2503.12793)
Keywords: attack
Abstract: Deep neural networks (DNNs) are susceptible to universal adversarial perturbations (UAPs). These perturbations are meticulously designed to fool the target model universally across all sample classes. Unlike instance-specific adversarial examples (AEs), generating UAPs is more complex because they must be generalized across a wide range of data samples and models. Our research reveals that existing universal attack methods, which optimize UAPs using DNNs with static model parameter snapshots, do not fully leverage the potential of DNNs to generate more effective UAPs. Rather than optimizing UAPs against static DNN models with a fixed training set, we suggest using dynamic model-data pairs to generate UAPs. In particular, we introduce a dynamic maximin optimization strategy, aiming to optimize the UAP across a variety of optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an iterative max-min-min optimization framework that refines the model-data pairs, coupled with a curriculum UAP learning algorithm to examine the combined space of model parameters and data thoroughly. Comprehensive experiments on the ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both cross-sample universality and cross-model transferability of UAPs. Using only 500 samples for UAP generation, DM-UAP outperforms the state-of-the-art approach with an average increase in fooling ratio of 12.108%.

Title: A Reinforcement Learning-Driven Transformer GAN for Molecular Generation

Authors: Chen Li, Huidong Tang, Ye Zhu, Yoshihiro Yamanishi
Subjects: cs.LG, cs.CL, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.12796
Pdf URL: https://arxiv.org/pdf/2503.12796
Copy Paste: [[2503.12796]] A Reinforcement Learning-Driven Transformer GAN for Molecular Generation(https://arxiv.org/abs/2503.12796)
Keywords: transformer, generative
Abstract: Generating molecules with desired chemical properties presents a critical challenge in fields such as chemical synthesis and drug discovery. Recent advancements in artificial intelligence (AI) and deep learning have significantly contributed to data-driven molecular generation. However, challenges persist due to the inherent sensitivity of simplified molecular input line entry system (SMILES) representations and the difficulties in applying generative adversarial networks (GANs) to discrete data. This study introduces RL-MolGAN, a novel Transformer-based discrete GAN framework designed to address these challenges. Unlike traditional Transformer architectures, RL-MolGAN utilizes a first-decoder-then-encoder structure, facilitating the generation of drug-like molecules from both $de~novo$ and scaffold-based designs. In addition, RL-MolGAN integrates reinforcement learning (RL) and Monte Carlo tree search (MCTS) techniques to enhance the stability of GAN training and optimize the chemical properties of the generated molecules. To further improve the model's performance, RL-MolWGAN, an extension of RL-MolGAN, incorporates Wasserstein distance and mini-batch discrimination, which together enhance the stability of the GAN. Experimental results on two widely used molecular datasets, QM9 and ZINC, validate the effectiveness of our models in generating high-quality molecular structures with diverse and desirable chemical properties.

Title: DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Authors: Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.12797
Pdf URL: https://arxiv.org/pdf/2503.12797
Copy Paste: [[2503.12797]] DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding(https://arxiv.org/abs/2503.12797)
Keywords: large language model
Abstract: Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at this https URL.

Title: Grounded Chain-of-Thought for Multimodal Large Language Models

Authors: Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.12799
Pdf URL: https://arxiv.org/pdf/2503.12799
Copy Paste: [[2503.12799]] Grounded Chain-of-Thought for Multimodal Large Language Models(https://arxiv.org/abs/2503.12799)
Keywords: large language model
Abstract: Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.

Title: Pairwise Similarity Regularization for Semi-supervised Graph Medical Image Segmentation

Authors: Jialu Zhou, Dianxi Shi, Shaowu Yang, Chunping Qiu, Luoxi Jing, Mengzhu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12800
Pdf URL: https://arxiv.org/pdf/2503.12800
Copy Paste: [[2503.12800]] Pairwise Similarity Regularization for Semi-supervised Graph Medical Image Segmentation(https://arxiv.org/abs/2503.12800)
Keywords: segmentation
Abstract: With fully leveraging the value of unlabeled data, semi-supervised medical image segmentation algorithms significantly reduces the limitation of limited labeled data, achieving a significant improvement in accuracy. However, the distributional shift between labeled and unlabeled data weakens the utilization of information from the labeled data. To alleviate the problem, we propose a graph network feature alignment method based on pairwise similarity regularization (PaSR) for semi-supervised medical image segmentation. PaSR aligns the graph structure of images in different domains by maintaining consistency in the pairwise structural similarity of feature graphs between the target domain and the source domain, reducing distribution shift issues in medical images. Meanwhile, further improving the accuracy of pseudo-labels in the teacher network by aligning graph clustering information to enhance the semi-supervised efficiency of the model. The experimental part was verified on three medical image segmentation benchmark datasets, with results showing improvements over advanced methods in various metrics. On the ACDC dataset, it achieved an average improvement of more than 10.66%.

Title: BLIA: Detect model memorization in binary classification model through passive Label Inference attack

Authors: Mohammad Wahiduzzaman Khan, Sheng Chen, Ilya Mironov, Leizhen Zhang, Rabib Noor
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.12801
Pdf URL: https://arxiv.org/pdf/2503.12801
Copy Paste: [[2503.12801]] BLIA: Detect model memorization in binary classification model through passive Label Inference attack(https://arxiv.org/abs/2503.12801)
Keywords: privacy, attack
Abstract: Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed "canaries," we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

Title: Leveraging Deep Neural Networks for Aspect-Based Sentiment Classification

Authors: Chen Li, Debo Cheng, Yasuhiko Morimoto
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12803
Pdf URL: https://arxiv.org/pdf/2503.12803
Copy Paste: [[2503.12803]] Leveraging Deep Neural Networks for Aspect-Based Sentiment Classification(https://arxiv.org/abs/2503.12803)
Keywords: extraction, transformer
Abstract: Aspect-based sentiment analysis seeks to determine sentiment with a high level of detail. While graph convolutional networks (GCNs) are commonly used for extracting sentiment features, their straightforward use in syntactic feature extraction can lead to a loss of crucial information. This paper presents a novel edge-enhanced GCN, called EEGCN, which improves performance by preserving feature integrity as it processes syntactic graphs. We incorporate a bidirectional long short-term memory (Bi-LSTM) network alongside a self-attention-based transformer for effective text encoding, ensuring the retention of long-range dependencies. A bidirectional GCN (Bi-GCN) with message passing then captures the relationships between entities, while an aspect-specific masking technique removes extraneous information. Extensive evaluations and ablation studies on four benchmark datasets show that EEGCN significantly enhances aspect-based sentiment analysis, overcoming issues with syntactic feature extraction and advancing the field's methodologies.

Title: A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Authors: Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12811
Pdf URL: https://arxiv.org/pdf/2503.12811
Copy Paste: [[2503.12811]] A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules(https://arxiv.org/abs/2503.12811)
Keywords: large language model
Abstract: Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

Title: From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Authors: Mingyang Song, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12821
Pdf URL: https://arxiv.org/pdf/2503.12821
Copy Paste: [[2503.12821]] From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration(https://arxiv.org/abs/2503.12821)
Keywords: diffusion
Abstract: Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.

Title: An Optimization Framework for Differentially Private Sparse Fine-Tuning

Authors: Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Hussein Hazimeh, Rahul Mazumder
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12822
Pdf URL: https://arxiv.org/pdf/2503.12822
Copy Paste: [[2503.12822]] An Optimization Framework for Differentially Private Sparse Fine-Tuning(https://arxiv.org/abs/2503.12822)
Keywords: privacy
Abstract: Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

Title: GSBAK$^K$: $top$-$K$ Geometric Score-based Black-box Attack

Authors: Md Farhamdur Reza, Richeng Jin, Tianfu Wu, Huaiyu Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12827
Pdf URL: https://arxiv.org/pdf/2503.12827
Copy Paste: [[2503.12827]] GSBAK$^K$: $top$-$K$ Geometric Score-based Black-box Attack(https://arxiv.org/abs/2503.12827)
Keywords: attack
Abstract: Existing score-based adversarial attacks mainly focus on crafting $top$-1 adversarial examples against classifiers with single-label classification. Their attack success rate and query efficiency are often less than satisfactory, particularly under small perturbation requirements; moreover, the vulnerability of classifiers with multi-label learning is yet to be studied. In this paper, we propose a comprehensive surrogate free score-based attack, named \b geometric \b score-based \b black-box \b attack (GSBAK$^K$), to craft adversarial examples in an aggressive $top$-$K$ setting for both untargeted and targeted attacks, where the goal is to change the $top$-$K$ predictions of the target classifier. We introduce novel gradient-based methods to find a good initial boundary point to attack. Our iterative method employs novel gradient estimation techniques, particularly effective in $top$-$K$ setting, on the decision boundary to effectively exploit the geometry of the decision boundary. Additionally, GSBAK$^K$ can be used to attack against classifiers with $top$-$K$ multi-label learning. Extensive experimental results on ImageNet and PASCAL VOC datasets validate the effectiveness of GSBAK$^K$ in crafting $top$-$K$ adversarial examples.

Title: CompMarkGS: Robust Watermarking for Compression 3D Gaussian Splatting

Authors: Sumin In, Youngdong Jang, Utae Jeong, MinHyuk Jang, Hyeongcheol Park, Eunbyung Park, Sangpil Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12836
Pdf URL: https://arxiv.org/pdf/2503.12836
Copy Paste: [[2503.12836]] CompMarkGS: Robust Watermarking for Compression 3D Gaussian Splatting(https://arxiv.org/abs/2503.12836)
Keywords: secure, protect, robust, watermark
Abstract: 3D Gaussian Splatting (3DGS) enables rapid differentiable rendering for 3D reconstruction and novel view synthesis, leading to its widespread commercial use. Consequently, copyright protection via watermarking has become critical. However, because 3DGS relies on millions of Gaussians, which require gigabytes of storage, efficient transfer and storage require compression. Existing 3DGS watermarking methods are vulnerable to quantization-based compression, often resulting in the loss of the embedded watermark. To address this challenge, we propose a novel watermarking method that ensures watermark robustness after model compression while maintaining high rendering quality. In detail, we incorporate a quantization distortion layer that simulates compression during training, preserving the watermark under quantization-based compression. Also, we propose a learnable watermark embedding feature that embeds the watermark into the anchor feature, ensuring structural consistency and seamless integration into the 3D scene. Furthermore, we present a frequency-aware anchor growing mechanism to enhance image quality in high-frequency regions by effectively identifying Guassians within these regions. Experimental results confirm that our method preserves the watermark and maintains superior image quality under high compression, validating it as a promising approach for a secure 3DGS model.

Title: DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

Authors: Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu, Zhao Wang, Yitong Wang, Xinglong Wu, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12838
Pdf URL: https://arxiv.org/pdf/2503.12838
Copy Paste: [[2503.12838]] DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode(https://arxiv.org/abs/2503.12838)
Keywords: robust, diffusion
Abstract: Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.

Title: Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Authors: Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12843
Pdf URL: https://arxiv.org/pdf/2503.12843
Copy Paste: [[2503.12843]] Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data(https://arxiv.org/abs/2503.12843)
Keywords: transformer
Abstract: Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both spatial and spectral continuity and physical characteristics of each patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct a benchmark, GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method surpasses current state-of-the-art multi-modal geospatial foundation models, achieving superior performance with less computation and fewer parameters. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Title: GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Authors: Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12844
Pdf URL: https://arxiv.org/pdf/2503.12844
Copy Paste: [[2503.12844]] GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance(https://arxiv.org/abs/2503.12844)
Keywords: large language model
Abstract: Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.

Title: ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing

Authors: Aditi Tiwari, Klara Nahrstedt
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.12852
Pdf URL: https://arxiv.org/pdf/2503.12852
Copy Paste: [[2503.12852]] ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing(https://arxiv.org/abs/2503.12852)
Keywords: robust, large language model
Abstract: Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.

Title: Adaptive Transformer Attention and Multi-Scale Fusion for Spine 3D Segmentation

Authors: Yanlin Xiang, Qingyuan He, Ting Xu, Ran Hao, Jiacheng Hu, Hanchao Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12853
Pdf URL: https://arxiv.org/pdf/2503.12853
Copy Paste: [[2503.12853]] Adaptive Transformer Attention and Multi-Scale Fusion for Spine 3D Segmentation(https://arxiv.org/abs/2503.12853)
Keywords: robust, extraction, transformer, segmentation
Abstract: This study proposes a 3D semantic segmentation method for the spine based on the improved SwinUNETR to improve segmentation accuracy and robustness. Aiming at the complex anatomical structure of spinal images, this paper introduces a multi-scale fusion mechanism to enhance the feature extraction capability by using information of different scales, thereby improving the recognition accuracy of the model for the target area. In addition, the introduction of the adaptive attention mechanism enables the model to dynamically adjust the attention to the key area, thereby optimizing the boundary segmentation effect. The experimental results show that compared with 3D CNN, 3D U-Net, and 3D U-Net + Transformer, the model of this study has achieved significant improvements in mIoU, mDice, and mAcc indicators, and has better segmentation performance. The ablation experiment further verifies the effectiveness of the proposed improved method, proving that multi-scale fusion and adaptive attention mechanism have a positive effect on the segmentation task. Through the visualization analysis of the inference results, the model can better restore the real anatomical structure of the spinal image. Future research can further optimize the Transformer structure and expand the data scale to improve the generalization ability of the model. This study provides an efficient solution for the task of medical image segmentation, which is of great significance to intelligent medical image analysis.

Title: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Authors: Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12854
Pdf URL: https://arxiv.org/pdf/2503.12854
Copy Paste: [[2503.12854]] Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation(https://arxiv.org/abs/2503.12854)
Keywords: large language model
Abstract: Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.

Title: Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models

Authors: Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12874
Pdf URL: https://arxiv.org/pdf/2503.12874
Copy Paste: [[2503.12874]] Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models(https://arxiv.org/abs/2503.12874)
Keywords: robust
Abstract: Large pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive generalization but remain highly vulnerable to adversarial examples (AEs). Previous work has explored robust text prompts through adversarial training, achieving some improvement in both robustness and generalization. However, they primarily rely on singlegradient direction perturbations (e.g., PGD) to generate AEs, which lack diversity, resulting in limited improvement in adversarial robustness. To address these limitations, we propose an evolution-based region adversarial prompt tuning method called ER-APT, which combines gradient methods with genetic evolution to generate more diverse and challenging AEs. In each training iteration, we first generate AEs using traditional gradient-based methods. Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs, ensuring a broader and more aggressive perturbation this http URL final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning. We also propose a dynamic loss weighting method to adjust prompt learning efficiency for accuracy and robustness. Experimental evaluations on various benchmark datasets demonstrate the superiority of our proposed method, outperforming stateof-the-art APT methods. The code is released at this https URL.

Title: An interpretable approach to automating the assessment of biofouling in video footage

Authors: Evelyn J. Mannix, Bartholomew A. Woodham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12875
Pdf URL: https://arxiv.org/pdf/2503.12875
Copy Paste: [[2503.12875]] An interpretable approach to automating the assessment of biofouling in video footage(https://arxiv.org/abs/2503.12875)
Keywords: transformer
Abstract: Biofouling$\unicode{x2013}$communities of organisms that grow on hard surfaces immersed in water$\unicode{x2013}$provides a pathway for the spread of invasive marine species and diseases. To address this risk, international vessels are increasingly being obligated to provide evidence of their biofouling management practices. Verification that these activities are effective requires underwater inspections, using divers or underwater remotely operated vehicles (ROVs), and the collection and analysis of large amounts of imagery and footage. Automated assessment using computer vision techniques can significantly streamline this process, and this work shows how this challenge can be addressed efficiently and effectively using the interpretable Component Features (ComFe) approach with a DINOv2 Vision Transformer (ViT) foundation model. ComFe is able to obtain improved performance in comparison to previous non-interpretable Convolutional Neural Network (CNN) methods, with significantly fewer weights and greater transparency$\unicode{x2013}$through identifying which regions of the image contribute to the classification, and which images in the training data lead to that conclusion. All code, data and model weights are publicly released.

Title: nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity

Authors: Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12880
Pdf URL: https://arxiv.org/pdf/2503.12880
Copy Paste: [[2503.12880]] nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity(https://arxiv.org/abs/2503.12880)
Keywords: large language model
Abstract: Natural Language to Visualization (NL2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, NL2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nvBench 2.0, a new benchmark designed to evaluate NL2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous NL2VIS tasks using nvBench 2.0. We also propose Step-NL2VIS, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-NL2VIS outperforms all baselines, setting a new state-of-the-art for ambiguous NL2VIS tasks.

Title: Early Detection of Forest Calamities in Homogeneous Stands -- Deep Learning Applied to Bark-Beetle Outbreaks

Authors: Maximilian Kirsch, Jakob Wernicke, Pawan Datta, Christine Preisach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12883
Pdf URL: https://arxiv.org/pdf/2503.12883
Copy Paste: [[2503.12883]] Early Detection of Forest Calamities in Homogeneous Stands -- Deep Learning Applied to Bark-Beetle Outbreaks(https://arxiv.org/abs/2503.12883)
Keywords: robust
Abstract: Climate change has increased the vulnerability of forests to insect-related damage, resulting in widespread forest loss in Central Europe and highlighting the need for effective, continuous monitoring systems. Remote sensing based forest health monitoring, oftentimes, relies on supervised machine learning algorithms that require labeled training data. Monitoring temporal patterns through time series analysis offers a potential alternative for earlier detection of disturbance but requires substantial storage resources. This study investigates the potential of a Deep Learning algorithm based on a Long Short Term Memory (LSTM) Autoencoder for the detection of anomalies in forest health (e.g. bark beetle outbreaks), utilizing Sentinel-2 time series data. This approach is an alternative to supervised machine learning methods, avoiding the necessity for labeled training data. Furthermore, it is more memory-efficient than other time series analysis approaches, as a robust model can be created using only a 26-week-long time series as input. In this study, we monitored pure stands of spruce in Thuringia, Germany, over a 7-year period from 2018 to the end of 2024. Our best model achieved a detection accuracy of 87% on test data and was able to detect 61% of all anomalies at a very early stage (more than a month before visible signs of forest degradation). Compared to another widely used time series break detection algorithm - BFAST (Breaks For Additive Season and Trend), our approach consistently detected higher percentage of anomalies at an earlier stage. These findings suggest that LSTM-based Autoencoders could provide a promising, resource-efficient approach to forest health monitoring, enabling more timely responses to emerging threats.

Title: UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network

Authors: Siyuan Yao, Yang Guo, Yanyang Yan, Wenqi Ren, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12888
Pdf URL: https://arxiv.org/pdf/2503.12888
Copy Paste: [[2503.12888]] UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network(https://arxiv.org/abs/2503.12888)
Keywords: robust, transformer
Abstract: Transformer-based trackers have achieved promising success and become the dominant tracking paradigm due to their accuracy and efficiency. Despite the substantial progress, most of the existing approaches tackle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been greatly overlooked, which hampers trackers' ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack utilizes a transformer encoder to perform feature interaction between template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable or not. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, making the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. Our code is available at this https URL.

Title: Safeguarding LLM Embeddings in End-Cloud Collaboration via Entropy-Driven Perturbation

Authors: Shuaifan Jin, Xiaoyi Pang, Zhibo Wang, He Wang, Jiacheng Du, Jiahui Hu, Kui Ren
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.12896
Pdf URL: https://arxiv.org/pdf/2503.12896
Copy Paste: [[2503.12896]] Safeguarding LLM Embeddings in End-Cloud Collaboration via Entropy-Driven Perturbation(https://arxiv.org/abs/2503.12896)
Keywords: privacy, protect, attack
Abstract: Recent studies improve on-device language model (LM) inference through end-cloud collaboration, where the end device retrieves useful information from cloud databases to enhance local processing, known as Retrieval-Augmented Generation (RAG). Typically, to retrieve information from the cloud while safeguarding privacy, the end device transforms original data into embeddings with a local embedding model. However, the recently emerging Embedding Inversion Attacks (EIAs) can still recover the original data from text embeddings (e.g., training a recovery model to map embeddings back to original texts), posing a significant threat to user privacy. To address this risk, we propose EntroGuard, an entropy-driven perturbation-based embedding privacy protection method, which can protect the privacy of text embeddings while maintaining retrieval accuracy during the end-cloud collaboration. Specifically, to defeat various EIAs, we perturb the embeddings to increase the entropy of the recovered text in the common structure of recovery models, thus steering the embeddings toward meaningless texts rather than original sensitive texts during the recovery process. To maintain retrieval performance in the cloud, we constrain the perturbations within a bound, applying the strategy of reducing them where redundant and increasing them where sparse. Moreover, EntroGuard can be directly integrated into end devices without requiring any modifications to the embedding model. Extensive experimental results demonstrate that EntroGuard can reduce the risk of privacy leakage by up to 8 times at most with negligible loss of retrieval performance compared to existing privacy-preserving methods.

Title: Federated Continual Instruction Tuning

Authors: Haiyang Guo, Fanhu Zeng, Fei Zhu, Wenzhuo Liu, Da-Han Wang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12897
Pdf URL: https://arxiv.org/pdf/2503.12897
Copy Paste: [[2503.12897]] Federated Continual Instruction Tuning(https://arxiv.org/abs/2503.12897)
Keywords: federate
Abstract: A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.

Title: Experiments with Optimal Model Trees

Authors: Sabino Francesco Roselli, Eibe Frank
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12902
Pdf URL: https://arxiv.org/pdf/2503.12902
Copy Paste: [[2503.12902]] Experiments with Optimal Model Trees(https://arxiv.org/abs/2503.12902)
Keywords: interpretability
Abstract: Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

Title: HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

Authors: Xinyan Jiang, Hang Ye, Yongxin Zhu, Xiaoying Zheng, Zikang Chen, Jun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12908
Pdf URL: https://arxiv.org/pdf/2503.12908
Copy Paste: [[2503.12908]] HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models(https://arxiv.org/abs/2503.12908)
Keywords: large language model
Abstract: Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more "contrast-effective" hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.

Title: ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

Authors: Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12918
Pdf URL: https://arxiv.org/pdf/2503.12918
Copy Paste: [[2503.12918]] ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs(https://arxiv.org/abs/2503.12918)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.

Title: MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

Authors: Huangwei Chen, Zhu Zhu, Zhenyu Yan, Yifei Chen, Mingyang Ding, Chenlei Li, Feiwei Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12927
Pdf URL: https://arxiv.org/pdf/2503.12927
Copy Paste: [[2503.12927]] MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation(https://arxiv.org/abs/2503.12927)
Keywords: robust, interpretability
Abstract: Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at this https URL.

Title: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

Authors: Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Xiuli Shao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12929
Pdf URL: https://arxiv.org/pdf/2503.12929
Copy Paste: [[2503.12929]] AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction(https://arxiv.org/abs/2503.12929)
Keywords: diffusion
Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.

Title: MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Authors: Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12931
Pdf URL: https://arxiv.org/pdf/2503.12931
Copy Paste: [[2503.12931]] MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting(https://arxiv.org/abs/2503.12931)
Keywords: defense, attack, large language model
Abstract: Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies generally rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules are incapable of accommodating the inherent complexity and dynamic nature of real jailbreak attacks. In this paper, we propose a novel concept of ``mirror'' to enable dynamic and adaptive defense. A mirror refers to a dynamically generated prompt that mirrors the syntactic structure of the input while ensuring semantic safety. The personalized discrepancies between the input prompts and their corresponding mirrors serve as the guiding principles for defense. A new defense paradigm, MirrorGuard, is further proposed to detect and calibrate risky inputs based on such mirrors. An entropy-based detection metric, Relative Input Uncertainty (RIU), is integrated into MirrorGuard to quantify the discrepancies between input prompts and mirrors. MirrorGuard is evaluated on several popular datasets, demonstrating state-of-the-art defense performance while maintaining general effectiveness.

Title: Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12932
Pdf URL: https://arxiv.org/pdf/2503.12932
Copy Paste: [[2503.12932]] Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs(https://arxiv.org/abs/2503.12932)
Keywords: generative
Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We have made the source code publicly available to encourage further research in this direction.

Title: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Authors: Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12941
Pdf URL: https://arxiv.org/pdf/2503.12941
Copy Paste: [[2503.12941]] HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model(https://arxiv.org/abs/2503.12941)
Keywords: large language model
Abstract: Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.

Title: GIFT: Generated Indoor video frames for Texture-less point tracking

Authors: Jianzheng Huang, Xianyu Mo, Ziling Liu, Jinyu Yang, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12944
Pdf URL: https://arxiv.org/pdf/2503.12944
Copy Paste: [[2503.12944]] GIFT: Generated Indoor video frames for Texture-less point tracking(https://arxiv.org/abs/2503.12944)
Keywords: robust
Abstract: Point tracking is becoming a powerful solver for motion estimation and video editing. Compared to classical feature matching, point tracking methods have the key advantage of robustly tracking points under complex camera motion trajectories and over extended periods. However, despite certain improvements in methodologies, current point tracking methods still struggle to track any position in video frames, especially in areas that are texture-less or weakly textured. In this work, we first introduce metrics for evaluating the texture intensity of a 3D object. Using these metrics, we classify the 3D models in ShapeNet into three levels of texture intensity and create GIFT, a challenging synthetic benchmark comprising 1800 indoor video sequences with rich annotations. Unlike existing datasets that assign ground truth points arbitrarily, GIFT precisely anchors ground truth on classified target objects, ensuring that each video corresponds to a specific texture intensity level. Furthermore, we comprehensively evaluate current methods on GIFT to assess their performance across different texture intensity levels and analyze the impact of texture on point tracking.

Title: Performance Analysis and Industry Deployment of Post-Quantum Cryptography Algorithms

Authors: Elif Dicle Demir, Buse Bilgin, Mehmet Cengiz Onbasli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.12952
Pdf URL: https://arxiv.org/pdf/2503.12952
Copy Paste: [[2503.12952]] Performance Analysis and Industry Deployment of Post-Quantum Cryptography Algorithms(https://arxiv.org/abs/2503.12952)
Keywords: secure, security, protect
Abstract: As quantum computing advances, modern cryptographic standards face an existential threat, necessitating a transition to post-quantum cryptography (PQC). The National Institute of Standards and Technology (NIST) has selected CRYSTALS-Kyber and CRYSTALS-Dilithium as standardized PQC algorithms for secure key exchange and digital signatures, respectively. This study conducts a comprehensive performance analysis of these algorithms by benchmarking execution times across cryptographic operations such as key generation, encapsulation, decapsulation, signing, and verification. Additionally, the impact of AVX2 optimizations is evaluated to assess hardware acceleration benefits. Our findings demonstrate that Kyber and Dilithium achieve efficient execution times, outperforming classical cryptographic schemes such as RSA and ECDSA at equivalent security levels. Beyond technical performance, the real-world deployment of PQC introduces challenges in telecommunications networks, where large-scale infrastructure upgrades, interoperability with legacy systems, and regulatory constraints must be addressed. This paper examines the feasibility of PQC adoption in telecom environments, highlighting key transition challenges, security risks, and implementation strategies. Through industry case studies, we illustrate how telecom operators are integrating PQC into 5G authentication, subscriber identity protection, and secure communications. Our analysis provides insights into the computational trade-offs, deployment considerations, and standardization efforts shaping the future of quantum-safe cryptographic infrastructure.

Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12953
Pdf URL: https://arxiv.org/pdf/2503.12953
Copy Paste: [[2503.12953]] Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction(https://arxiv.org/abs/2503.12953)
Keywords: diffusion
Abstract: Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .

Title: HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

Authors: Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12955
Pdf URL: https://arxiv.org/pdf/2503.12955
Copy Paste: [[2503.12955]] HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding(https://arxiv.org/abs/2503.12955)
Keywords: large language model
Abstract: We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models.

Title: FedSDP: Explainable Differential Privacy in Federated Learning via Shapley Values

Authors: Yunbo Li, Jiaping Gui, Yue Wu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.12958
Pdf URL: https://arxiv.org/pdf/2503.12958
Copy Paste: [[2503.12958]] FedSDP: Explainable Differential Privacy in Federated Learning via Shapley Values(https://arxiv.org/abs/2503.12958)
Keywords: privacy, protect, attack, federate, interpretability
Abstract: Federated learning (FL) enables participants to store data locally while collaborating in training, yet it remains vulnerable to privacy attacks, such as data reconstruction. Existing differential privacy (DP) technologies inject noise dynamically into the training process to mitigate the impact of excessive noise. However, this dynamic scheduling is often grounded in factors indirectly related to privacy, making it difficult to clearly explain the intricate relationship between dynamic noise adjustments and privacy requirements. To address this issue, we propose FedSDP, a novel and explainable DP-based privacy protection mechanism that guides noise injection based on privacy contribution. Specifically, FedSDP leverages Shapley values to assess the contribution of private attributes to local model training and dynamically adjusts the amount of noise injected accordingly. By providing theoretical insights into the injection of varying scales of noise into local training, FedSDP enhances interpretability. Extensive experiments demonstrate that FedSDP can achieve a superior balance between privacy preservation and model performance, surpassing state-of-the-art (SOTA) solutions.

Title: Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Authors: Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12963
Pdf URL: https://arxiv.org/pdf/2503.12963
Copy Paste: [[2503.12963]] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait(https://arxiv.org/abs/2503.12963)
Keywords: diffusion, generative
Abstract: Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.

Title: Training Video Foundation Models with NVIDIA NeMo

Authors: Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12964
Pdf URL: https://arxiv.org/pdf/2503.12964
Copy Paste: [[2503.12964]] Training Video Foundation Models with NVIDIA NeMo(https://arxiv.org/abs/2503.12964)
Keywords: diffusion
Abstract: Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Title: Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Authors: Eliot Beyler (SIERRA), Francis Bach (SIERRA)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12966
Pdf URL: https://arxiv.org/pdf/2503.12966
Copy Paste: [[2503.12966]] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity(https://arxiv.org/abs/2503.12966)
Keywords: generative
Abstract: Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ''full-denoising'', to the alternative ''half-denoising'' introduced by Hyv{ä}rinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different this http URL prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

Title: OptiPMB: Enhancing 3D Multi-Object Tracking with Optimized Poisson Multi-Bernoulli Filtering

Authors: Guanhua Ding, Yuxuan Xia, Runwei Guan, Qinchen Wu, Tao Huang, Weiping Ding, Jinping Sun, Guoqiang Mao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12968
Pdf URL: https://arxiv.org/pdf/2503.12968
Copy Paste: [[2503.12968]] OptiPMB: Enhancing 3D Multi-Object Tracking with Optimized Poisson Multi-Bernoulli Filtering(https://arxiv.org/abs/2503.12968)
Keywords: robust, extraction, interpretability
Abstract: Accurate 3D multi-object tracking (MOT) is crucial for autonomous driving, as it enables robust perception, navigation, and planning in complex environments. While deep learning-based solutions have demonstrated impressive 3D MOT performance, model-based approaches remain appealing for their simplicity, interpretability, and data efficiency. Conventional model-based trackers typically rely on random vector-based Bayesian filters within the tracking-by-detection (TBD) framework but face limitations due to heuristic data association and track management schemes. In contrast, random finite set (RFS)-based Bayesian filtering handles object birth, survival, and death in a theoretically sound manner, facilitating interpretability and parameter tuning. In this paper, we present OptiPMB, a novel RFS-based 3D MOT method that employs an optimized Poisson multi-Bernoulli (PMB) filter while incorporating several key innovative designs within the TBD framework. Specifically, we propose a measurement-driven hybrid adaptive birth model for improved track initialization, employ adaptive detection probability parameters to effectively maintain tracks for occluded objects, and optimize density pruning and track extraction modules to further enhance overall tracking performance. Extensive evaluations on nuScenes and KITTI datasets show that OptiPMB achieves superior tracking accuracy compared with state-of-the-art methods, thereby establishing a new benchmark for model-based 3D MOT and offering valuable insights for future research on RFS-based trackers in autonomous driving.

Title: Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Authors: Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Botian Shi, Ding Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12972
Pdf URL: https://arxiv.org/pdf/2503.12972
Copy Paste: [[2503.12972]] Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning(https://arxiv.org/abs/2503.12972)
Keywords: large language model
Abstract: Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at this https URL.

Title: Prospects for Mitigating Spectral Variability in Tropical Species Classification Using Self-Supervised Learning

Authors: Colin Prieur, Nassim Ait Ali Braham, Paul Tresson, Grégoire Vincent, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12973
Pdf URL: https://arxiv.org/pdf/2503.12973
Copy Paste: [[2503.12973]] Prospects for Mitigating Spectral Variability in Tropical Species Classification Using Self-Supervised Learning(https://arxiv.org/abs/2503.12973)
Keywords: robust
Abstract: Airborne hyperspectral imaging is a promising method for identifying tropical species, but spectral variability between acquisitions hinders consistent results. This paper proposes using Self-Supervised Learning (SSL) to encode spectral features that are robust to abiotic variability and relevant for species identification. By employing the state-of-the-art Barlow-Twins approach on repeated spectral acquisitions, we demonstrate the ability to develop stable features. For the classification of 40 tropical species, experiments show that these features can outperform typical reflectance products in terms of robustness to spectral variability by 10 points of accuracy across dates.

Title: Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning

Authors: Xueying Jiang, Wenhao Li, Xiaoqin Zhang, Ling Shao, Shijian Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12974
Pdf URL: https://arxiv.org/pdf/2503.12974
Copy Paste: [[2503.12974]] Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning(https://arxiv.org/abs/2503.12974)
Keywords: segmentation
Abstract: 3D activity reasoning and planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advance in multimodal learning. However, most existing works share two constraints: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. To bridge the gaps, we propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans, and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.

Title: Enhancing Job Salary Prediction with Disentangled Composition Effect Modeling: A Neural Prototyping Approach

Authors: Yang Ji, Ying Sun, Hengshu Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12978
Pdf URL: https://arxiv.org/pdf/2503.12978
Copy Paste: [[2503.12978]] Enhancing Job Salary Prediction with Disentangled Composition Effect Modeling: A Neural Prototyping Approach(https://arxiv.org/abs/2503.12978)
Keywords: explainability
Abstract: In the era of the knowledge economy, understanding how job skills influence salary is crucial for promoting recruitment with competitive salary systems and aligned salary expectations. Despite efforts on salary prediction based on job positions and talent demographics, there still lacks methods to effectively discern the set-structured skills' intricate composition effect on job salary. While recent advances in neural networks have significantly improved accurate set-based quantitative modeling, their lack of explainability hinders obtaining insights into the skills' composition effects. Indeed, model explanation for set data is challenging due to the combinatorial nature, rich semantics, and unique format. To this end, in this paper, we propose a novel intrinsically explainable set-based neural prototyping approach, namely \textbf{LGDESetNet}, for explainable salary prediction that can reveal disentangled skill sets that impact salary from both local and global perspectives. Specifically, we propose a skill graph-enhanced disentangled discrete subset selection layer to identify multi-faceted influential input subsets with varied semantics. Furthermore, we propose a set-oriented prototype learning method to extract globally influential prototypical sets. The resulting output is transparently derived from the semantic interplay between these input subsets and global prototypes. Extensive experiments on four real-world datasets demonstrate that our method achieves superior performance than state-of-the-art baselines in salary prediction while providing explainable insights into salary-influencing patterns.

Title: SparseAlign: A Fully Sparse Framework for Cooperative Object Detection

Authors: Yunshuang Yuan, Yan Xia, Daniel Cremers, Monika Sester
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12982
Pdf URL: https://arxiv.org/pdf/2503.12982
Copy Paste: [[2503.12982]] SparseAlign: A Fully Sparse Framework for Cooperative Object Detection(https://arxiv.org/abs/2503.12982)
Keywords: robust
Abstract: Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.

Title: A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Authors: Palakorn Achananuparp, Ee-Peng Lim
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2503.12989
Pdf URL: https://arxiv.org/pdf/2503.12989
Copy Paste: [[2503.12989]] A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models(https://arxiv.org/abs/2503.12989)
Keywords: large language model
Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show significant improvements in classification accuracy. Furthermore, we demonstrate the framework's adaptability for multi-label skill classification. Our results indicate that the framework outperforms existing LLM-based methods, offering a practical and scalable solution for occupation classification and related tasks across LLMs.

Title: TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba

Authors: Jiaxu Liu, Li Li, Hubert P. H. Shum, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13004
Pdf URL: https://arxiv.org/pdf/2503.13004
Copy Paste: [[2503.13004]] TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba(https://arxiv.org/abs/2503.13004)
Keywords: diffusion, generative
Abstract: Diffusion models currently demonstrate impressive performance over various generative tasks. Recent work on image diffusion highlights the strong capabilities of Mamba (state space models) due to its efficient handling of long-range dependencies and sequential data modeling. Unfortunately, joint consideration of state space models with 3D point cloud generation remains limited. To harness the powerful capabilities of the Mamba model for 3D point cloud generation, we propose a novel diffusion framework containing dual latent Mamba block (DM-Block) and a time-variant frequency encoder (TF-Encoder). The DM-Block apply a space-filling curve to reorder points into sequences suitable for Mamba state-space modeling, while operating in a latent space to mitigate the computational overhead that arises from direct 3D data processing. Meanwhile, the TF-Encoder takes advantage of the ability of the diffusion model to refine fine details in later recovery stages by prioritizing key points within the U-Net architecture. This frequency-based mechanism ensures enhanced detail quality in the final stages of generation. Experimental results on the ShapeNet-v2 dataset demonstrate that our method achieves state-of-the-art performance (ShapeNet-v2: 0.14\% on 1-NNA-Abs50 EMD and 57.90\% on COV EMD) on certain metrics for specific categories while reducing computational parameters and inference time by up to 10$\times$ and 9$\times$, respectively. Source code is available in Supplementary Materials and will be released upon accpetance.

Title: Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

Authors: David E. Hernandez, Jose Ramon Chang, Torbjörn E. M. Nordling
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.13008
Pdf URL: https://arxiv.org/pdf/2503.13008
Copy Paste: [[2503.13008]] Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients(https://arxiv.org/abs/2503.13008)
Keywords: explainability
Abstract: Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG), an attribution method, to optimise the compression of convolutional neural networks. We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. This approach leverages the teacher's decision-making insights, enhancing the student's ability to replicate complex patterns with reduced parameters. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold speedup. We perform hyperparameter optimisation for efficient learning. Comprehensive ablation studies dissect the contributions of KD and IG, revealing synergistic effects that boost both performance and model explainability. Our method's emphasis on feature-level guidance via IG distinguishes it from conventional KD, offering a data-driven solution for mining transferable knowledge in neural architectures. This work contributes to machine learning by providing a scalable, interpretable compression technique, ideal for edge computing applications where efficiency and transparency are paramount.

Title: Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

Authors: Xingguo Lv, Xingbo Dong, Liwen Wang, Jiewen Yang, Lei Zhao, Bin Pu, Zhe Jin, Xuejun Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13012
Pdf URL: https://arxiv.org/pdf/2503.13012
Copy Paste: [[2503.13012]] Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation(https://arxiv.org/abs/2503.13012)
Keywords: segmentation
Abstract: Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at this https URL.

Title: PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

Authors: ChangHee Yang, Hyeonseop Song, Seokhun Choi, Seungwoo Lee, Jaechul Kim, Hoseok Do
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13025
Pdf URL: https://arxiv.org/pdf/2503.13025
Copy Paste: [[2503.13025]] PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data(https://arxiv.org/abs/2503.13025)
Keywords: extraction
Abstract: Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.

Title: HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

Authors: Tao Wang, Changxu Cheng, Lingfeng Wang, Senda Chen, Wuyue Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13026
Pdf URL: https://arxiv.org/pdf/2503.13026
Copy Paste: [[2503.13026]] HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model(https://arxiv.org/abs/2503.13026)
Keywords: segmentation
Abstract: The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community. To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input. However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs. In this work, we propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning. Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential. Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.

Title: Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task

Authors: Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu, Qingyao Ai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13038
Pdf URL: https://arxiv.org/pdf/2503.13038
Copy Paste: [[2503.13038]] Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task(https://arxiv.org/abs/2503.13038)
Keywords: generative, large language model
Abstract: In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.

Title: InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

Authors: Ruiqi Song, Xianda Guo, Hangbin Wu, Qinggong Wei, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13047
Pdf URL: https://arxiv.org/pdf/2503.13047
Copy Paste: [[2503.13047]] InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving(https://arxiv.org/abs/2503.13047)
Keywords: robust
Abstract: Directly generating planning results from raw sensors has become increasingly prevalent due to its adaptability and robustness in complex scenarios. Scene representation, as a key module in the pipeline, has traditionally relied on conventional perception, which focus on the global scene. However, in driving scenarios, human drivers typically focus only on regions that directly impact driving, which often coincide with those required for end-to-end autonomous driving. In this paper, a novel end-to-end autonomous driving method called InsightDrive is proposed, which organizes perception by language-guided scene representation. We introduce an instance-centric scene tokenizer that transforms the surrounding environment into map- and object-aware instance tokens. Scene attention language descriptions, which highlight key regions and obstacles affecting the ego vehicle's movement, are generated by a vision-language model that leverages the cognitive reasoning capabilities of foundation models. We then align scene descriptions with visual features using the vision-language model, guiding visual attention through these descriptions to give effectively scene representation. Furthermore, we employ self-attention and cross-attention mechanisms to model the ego-agents and ego-map relationships to comprehensively build the topological relationships of the scene. Finally, based on scene understanding, we jointly perform motion prediction and planning. Extensive experiments on the widely used nuScenes benchmark demonstrate that the proposed InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving. The code is available at this https URL

Title: Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation

Authors: Nassim Ali Ousalah, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13053
Pdf URL: https://arxiv.org/pdf/2503.13053
Copy Paste: [[2503.13053]] Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation(https://arxiv.org/abs/2503.13053)
Keywords: robust
Abstract: Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.

Title: MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling

Authors: Robin Zbinden, Nina van Tiel, Gencer Sumbul, Chiara Vanalli, Benjamin Kellenberger, Devis Tuia
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.13057
Pdf URL: https://arxiv.org/pdf/2503.13057
Copy Paste: [[2503.13057]] MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling(https://arxiv.org/abs/2503.13057)
Keywords: robust, explainability
Abstract: Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling by predicting species distributions based on environmental conditions. The selection of predictors is crucial, strongly impacting both model accuracy and how well the predictions reflect ecological patterns. To ensure meaningful insights, input variables must be carefully chosen to match the study objectives and the ecological requirements of the target species. However, existing SDMs, including both traditional and deep learning-based approaches, often lack key capabilities for variable selection: (i) flexibility to choose relevant predictors at inference without retraining; (ii) robustness to handle missing predictor values without compromising accuracy; and (iii) explainability to interpret and accurately quantify each predictor's contribution. To overcome these limitations, we introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy. This approach allows the model to make predictions with arbitrary subsets of input variables while remaining robust to missing data. It also provides a clearer understanding of how adding or removing a given predictor affects model performance and predictions. Additionally, MaskSDM leverages Shapley values for precise predictor contribution assessments, improving upon traditional approximations. We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species. Our results show that MaskSDM outperforms imputation-based methods and approximates models trained on specific subsets of variables. These findings underscore MaskSDM's potential to increase the applicability and adoption of SDMs, laying the groundwork for developing foundation models in SDMs that can be readily applied to diverse ecological applications.

Title: Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

Authors: Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13058
Pdf URL: https://arxiv.org/pdf/2503.13058
Copy Paste: [[2503.13058]] Do Vision Models Develop Human-Like Progressive Difficulty Understanding?(https://arxiv.org/abs/2503.13058)
Keywords: generative
Abstract: When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.

Title: Federated Learning with Domain Shift Eraser

Authors: Zheng Wang, Zihui Wang, Zheng Wang, Xiaoliang Fan, Cheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13063
Pdf URL: https://arxiv.org/pdf/2503.13063
Copy Paste: [[2503.13063]] Federated Learning with Domain Shift Eraser(https://arxiv.org/abs/2503.13063)
Keywords: federate, fair
Abstract: Federated learning (FL) is emerging as a promising technique for collaborative learning without local data leaving their devices. However, clients' data originating from diverse domains may degrade model performance due to domain shifts, preventing the model from learning consistent representation space. In this paper, we propose a novel FL framework, Federated Domain Shift Eraser (FDSE), to improve model performance by differently erasing each client's domain skew and enhancing their consensus. First, we formulate the model forward passing as an iterative deskewing process that extracts and then deskews features alternatively. This is efficiently achieved by decomposing each original layer in the neural network into a Domain-agnostic Feature Extractor (DFE) and a Domain-specific Skew Eraser (DSE). Then, a regularization term is applied to promise the effectiveness of feature deskewing by pulling local statistics of DSE's outputs close to the globally consistent ones. Finally, DFE modules are fairly aggregated and broadcast to all the clients to maximize their consensus, and DSE modules are personalized for each client via similarity-aware aggregation to erase their domain skew differently. Comprehensive experiments were conducted on three datasets to confirm the advantages of our method in terms of accuracy, efficiency, and generalizability.

Title: Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13070
Pdf URL: https://arxiv.org/pdf/2503.13070
Copy Paste: [[2503.13070]] Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation(https://arxiv.org/abs/2503.13070)
Keywords: diffusion, generative
Abstract: Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at this https URL.

Title: DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model

Authors: Zhicheng Zhao, Jinquan Yan, Chenglong Li, Xiao Wang, Jin Tang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.13073
Pdf URL: https://arxiv.org/pdf/2503.13073
Copy Paste: [[2503.13073]] DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model(https://arxiv.org/abs/2503.13073)
Keywords: segmentation
Abstract: Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non-uniform haze distribution, which traditional single-image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze-free reference information for large-scale scenes, existing SAR-guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze-free regions, and the instability of feature quality further exacerbates cross-modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR-guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze-affected regions through optical-SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two-stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large-scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo-registered SAR-optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state-of-the-art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at this https URL.

Title: Rethinking Image Evaluation in Super-Resolution

Authors: Shaolin Su, Josep M. Rocafort, Danna Xue, David Serrano-Lozano, Lei Sun, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13074
Pdf URL: https://arxiv.org/pdf/2503.13074
Copy Paste: [[2503.13074]] Rethinking Image Evaluation in Super-Resolution(https://arxiv.org/abs/2503.13074)
Keywords: fair
Abstract: While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100\% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

Title: A Framework to Assess Multilingual Vulnerabilities of LLMs

Authors: Likai Tang, Niruth Bogahawatta, Yasod Ginige, Jiarui Xu, Shixuan Sun, Surangika Ranathunga, Suranga Seneviratne
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13081
Pdf URL: https://arxiv.org/pdf/2503.13081
Copy Paste: [[2503.13081]] A Framework to Assess Multilingual Vulnerabilities of LLMs(https://arxiv.org/abs/2503.13081)
Keywords: attack, large language model
Abstract: Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.

Title: Gaussian On-the-Fly Splatting: A Progressive Framework for Robust Near Real-Time 3DGS Optimization

Authors: Yiwei Xu, Yifei Yu, Wentian Gan, Tengfei Wang, Zongqian Zhan, Hao Cheng, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13086
Pdf URL: https://arxiv.org/pdf/2503.13086
Copy Paste: [[2503.13086]] Gaussian On-the-Fly Splatting: A Progressive Framework for Robust Near Real-Time 3DGS Optimization(https://arxiv.org/abs/2503.13086)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) achieves high-fidelity rendering with fast real-time performance, but existing methods rely on offline training after full Structure-from-Motion (SfM) processing. In contrast, this work introduces On-the-Fly GS, a progressive framework enabling near real-time 3DGS optimization during image capture. As each image arrives, its pose and sparse points are updated via on-the-fly SfM, and newly optimized Gaussians are immediately integrated into the 3DGS field. We propose a progressive local optimization strategy to prioritize new images and their neighbors by their corresponding overlapping relationship, allowing the new image and its overlapping images to get more training. To further stabilize training across old and new images, an adaptive learning rate schedule balances the iterations and the learning rate. Moreover, to maintain overall quality of the 3DGS field, an efficient global optimization scheme prevents overfitting to the newly added images. Experiments on multiple benchmark datasets show that our On-the-Fly GS reduces training time significantly, optimizing each new image in seconds with minimal rendering loss, offering the first practical step toward rapid, progressive 3DGS reconstruction.

Title: ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

Authors: Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13089
Pdf URL: https://arxiv.org/pdf/2503.13089
Copy Paste: [[2503.13089]] ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning(https://arxiv.org/abs/2503.13089)
Keywords: large language model
Abstract: As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Title: Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

Authors: Babangida Sani, Aakansha Soy, Sukairaj Hafiz Imam, Ahmad Mustapha, Lukman Jibril Aliyu, Idris Abdulmumin, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13101
Pdf URL: https://arxiv.org/pdf/2503.13101
Copy Paste: [[2503.13101]] Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa(https://arxiv.org/abs/2503.13101)
Keywords: large language model
Abstract: The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.

Title: REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

Authors: Alexander Pugachev, Alena Fenogenova, Vladislav Mikhailov, Ekaterina Artemova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13102
Pdf URL: https://arxiv.org/pdf/2503.13102
Copy Paste: [[2503.13102]] REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities(https://arxiv.org/abs/2503.13102)
Keywords: generative, large language model
Abstract: Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.

Title: ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Authors: Hao Yin, Guangzong Si, Zilei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13107
Pdf URL: https://arxiv.org/pdf/2503.13107
Copy Paste: [[2503.13107]] ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models(https://arxiv.org/abs/2503.13107)
Keywords: large language model
Abstract: Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model's bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.

Title: Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

Authors: Hao Yin, Guangzong Si, Zilei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13108
Pdf URL: https://arxiv.org/pdf/2503.13108
Copy Paste: [[2503.13108]] Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference(https://arxiv.org/abs/2503.13108)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual information remains unclear. In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within visual modality. Based on these insights, we propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance. Our findings offer a new understanding of visual information processing in MLLMs and provide a state-of-the-art solution for efficient inference.

Title: Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

Authors: Kedi Chen, Zhikai Lei, Fan Zhang, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13109
Pdf URL: https://arxiv.org/pdf/2503.13109
Copy Paste: [[2503.13109]] Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences(https://arxiv.org/abs/2503.13109)
Keywords: large language model
Abstract: Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.

Title: DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

Authors: Jing Li, Yihang Fu, Falai Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13110
Pdf URL: https://arxiv.org/pdf/2503.13110
Copy Paste: [[2503.13110]] DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry(https://arxiv.org/abs/2503.13110)
Keywords: diffusion, transformer, generative
Abstract: Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at this https URL.

Title: MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Authors: Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, Peter Grasch
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13111
Pdf URL: https://arxiv.org/pdf/2503.13111
Copy Paste: [[2503.13111]] MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs(https://arxiv.org/abs/2503.13111)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.

Title: VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding

Authors: Zeng Wang, Minghao Shao, Mohammed Nabeel, Prithwish Basu Roy, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Subjects: cs.CR, cs.AR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13116
Pdf URL: https://arxiv.org/pdf/2503.13116
Copy Paste: [[2503.13116]] VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding(https://arxiv.org/abs/2503.13116)
Keywords: protect, defense, large language model
Abstract: Large language models (LLMs) offer significant potential for coding, yet fine-tuning (FT) with curated data is essential for niche languages like Verilog. Using proprietary intellectual property (IP) for FT presents a serious risk, as FT data can be leaked through LLM inference. This leads to a critical dilemma for design houses: seeking to build externally accessible LLMs offering competitive Verilog coding, how can they leverage in-house IP to enhance FT utility while ensuring IP protection? For the first time in the literature, we study this dilemma. Using LLaMA 3.1-8B, we conduct in-house FT on a baseline Verilog dataset (RTLCoder) supplemented with our own in-house IP, which is validated through multiple tape-outs. To rigorously assess IP leakage, we quantify structural similarity (AST/Dolos) and functional equivalence (Synopsys Formality) between generated codes and our in-house IP. We show that our IP can indeed be leaked, confirming the threat. As defense, we evaluate logic locking of Verilog codes (ASSURE). This offers some level of protection, yet reduces the IP's utility for FT and degrades the LLM's performance. Our study shows the need for novel strategies that are both effective and minimally disruptive to FT, an essential effort for enabling design houses to fully utilize their proprietary IP toward LLM-driven Verilog coding.

Title: 3D Human Interaction Generation: A Survey

Authors: Siyuan Fan, Wenke Huang, Xiantao Cai, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13120
Pdf URL: https://arxiv.org/pdf/2503.13120
Copy Paste: [[2503.13120]] 3D Human Interaction Generation: A Survey(https://arxiv.org/abs/2503.13120)
Keywords: generative
Abstract: 3D human interaction generation has emerged as a key research area, focusing on producing dynamic and contextually relevant interactions between humans and various interactive entities. Recent rapid advancements in 3D model representation methods, motion capture technologies, and generative models have laid a solid foundation for the growing interest in this domain. Existing research in this field can be broadly categorized into three areas: human-scene interaction, human-object interaction, and human-human interaction. Despite the rapid advancements in this area, challenges remain due to the need for naturalness in human motion generation and the accurate interaction between humans and interactive entities. In this survey, we present a comprehensive literature review of human interaction generation, which, to the best of our knowledge, is the first of its kind. We begin by introducing the foundational technologies, including model representations, motion capture methods, and generative models. Subsequently, we introduce the approaches proposed for the three sub-tasks, along with their corresponding datasets and evaluation metrics. Finally, we discuss potential future research directions in this area and conclude the survey. Through this survey, we aim to offer a comprehensive overview of the current advancements in the field, highlight key challenges, and inspire future research works.

Title: ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

Authors: Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13130
Pdf URL: https://arxiv.org/pdf/2503.13130
Copy Paste: [[2503.13130]] ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation(https://arxiv.org/abs/2503.13130)
Keywords: generative
Abstract: We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{this https URL}{here}.

Title: Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images

Authors: Yaxi Chen, Simin Ni, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13131
Pdf URL: https://arxiv.org/pdf/2503.13131
Copy Paste: [[2503.13131]] Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images(https://arxiv.org/abs/2503.13131)
Keywords: interpretability, diffusion, generative
Abstract: Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: this https URL

Title: DynSTG-Mamba: Dynamic Spatio-Temporal Graph Mamba with Cross-Graph Knowledge Distillation for Gait Disorders Recognition

Authors: Zakariae Zrimek, Youssef Mourchid, Mohammed El Hassouni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13156
Pdf URL: https://arxiv.org/pdf/2503.13156
Copy Paste: [[2503.13156]] DynSTG-Mamba: Dynamic Spatio-Temporal Graph Mamba with Cross-Graph Knowledge Distillation for Gait Disorders Recognition(https://arxiv.org/abs/2503.13156)
Keywords: robust
Abstract: Gait disorder recognition plays a crucial role in the early diagnosis and monitoring of movement disorders. Existing approaches, including spatio-temporal graph convolutional networks (ST-GCNs), often face high memory demands and struggle to capture complex spatio-temporal dependencies, limiting their efficiency in clinical applications. To address these challenges, we introduce DynSTG-Mamba (Dynamic Spatio-Temporal Graph Mamba), a novel framework that combines DF-STGNN and STG-Mamba to enhance motion sequence modeling. The DF-STGNN incorporates a dynamic spatio-temporal filter that adaptively adjusts spatial connections between skeletal joints and temporal interactions across different movement phases. This approach ensures better feature propagation through dynamic graph structures by considering the hierarchical nature and dynamics of skeletal gait data. Meanwhile, STG-Mamba, an extension of Mamba adapted for skeletal motion data, ensures a continuous propagation of states, facilitating the capture of long-term dependencies while reducing computational complexity. To reduce the number of model parameters and computational costs while maintaining consistency, we propose Cross-Graph Relational Knowledge Distillation, a novel knowledge transfer mechanism that aligns relational information between teacher (large architecture) and student models (small architecture) while using shared memory. This ensures that the interactions and movement patterns of the joints are accurately preserved in the motion sequences. We validate our DynSTG-Mamba on KOA-NM, PD-WALK, and ATAXIA datasets, where it outperforms state-of-the-art approaches by achieving in terms of Accuracy, F1-score, and Recall. Our results highlight the efficiency and robustness of our approach, offering a lightweight yet highly accurate solution for automated gait analysis and movement disorder assessment.

Title: Laplace-Net: Learning Dynamical Systems with External Forcing

Authors: Bernd Zimmering, Cecília Coelho, Vaibhav Gupta, Maria Maleshkova, Oliver Niggemann
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.13158
Pdf URL: https://arxiv.org/pdf/2503.13158
Copy Paste: [[2503.13158]] Laplace-Net: Learning Dynamical Systems with External Forcing(https://arxiv.org/abs/2503.13158)
Keywords: robust, interpretability
Abstract: Modelling forced dynamical systems - where an external input drives the system state - is critical across diverse domains such as engineering, finance, and the natural sciences. In this work, we propose Laplace-Net, a decoupled, solver-free neural framework for learning forced and delay-aware systems. It leverages a Laplace transform-based approach to decompose internal dynamics, external inputs, and initial values into established theoretical concepts, enhancing interpretability. Laplace-Net promotes transferability since the system can be rapidly re-trained or fine-tuned for new forcing signals, providing flexibility in applications ranging from controller adaptation to long-horizon forecasting. Experimental results on eight benchmark datasets - including linear, non-linear, and delayed systems - demonstrate the method's improved accuracy and robustness compared to state-of-the-art approaches, particularly in handling complex and previously unseen inputs.

Title: Language-guided Open-world Video Anomaly Detection

Authors: Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13160
Pdf URL: https://arxiv.org/pdf/2503.13160
Copy Paste: [[2503.13160]] Language-guided Open-world Video Anomaly Detection(https://arxiv.org/abs/2503.13160)
Keywords: robust
Abstract: Video anomaly detection models aim to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask is considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly score. Therefore, we propose LaGoVAD (Language-guided Open-world VAD), a model that dynamically adapts anomaly definitions through two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide given labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate SOTA performance. Data and code will be released.

Title: PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

Authors: Ori Peleg, Natalie Lang, Stefano Rini, Nir Shlezinger, Kobi Cohen
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.13173
Pdf URL: https://arxiv.org/pdf/2503.13173
Copy Paste: [[2503.13173]] PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning(https://arxiv.org/abs/2503.13173)
Keywords: privacy, federate
Abstract: Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

Title: DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

Authors: Rui Wang, Quentin Lohmeyer, Mirko Meboldt, Siyu Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13176
Pdf URL: https://arxiv.org/pdf/2503.13176
Copy Paste: [[2503.13176]] DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction(https://arxiv.org/abs/2503.13176)
Keywords: robust
Abstract: Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstructionin highly dynamic, interaction-rich environments.

Title: GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation

Authors: Jungwon Seo, Ferhat Ozgur Catak, Chunming Rong, Kibeom Hong, Minhoe Kim
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2503.13180
Pdf URL: https://arxiv.org/pdf/2503.13180
Copy Paste: [[2503.13180]] GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation(https://arxiv.org/abs/2503.13180)
Keywords: privacy, extraction, federate
Abstract: Multi-source information fusion (MSIF) leverages diverse data streams to enhance decision-making, situational awareness, and system resilience. Federated Learning (FL) enables MSIF while preserving privacy but suffers from client drift under high data heterogeneity, leading to performance degradation. Traditional mitigation strategies rely on reference-based gradient adjustments, which can be unstable in partial participation settings. To address this, we propose Gradient Centralized Federated Learning (GC-Fed), a reference-free gradient correction method inspired by Gradient Centralization (GC). We introduce Local GC and Global GC, applying GC during local training and global aggregation, respectively. Our hybrid GC-Fed approach selectively applies GC at the feature extraction layer locally and at the classifier layer globally, improving training stability and model performance. Theoretical analysis and empirical results demonstrate that GC-Fed mitigates client drift and achieves state-of-the-art accuracy gains of up to 20% in heterogeneous settings.

Title: 3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o

Authors: Dingning Liu, Cheng Wang, Peng Gao, Renrui Zhang, Xinzhu Ma, Yuan Meng, Zhihui Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13185
Pdf URL: https://arxiv.org/pdf/2503.13185
Copy Paste: [[2503.13185]] 3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o(https://arxiv.org/abs/2503.13185)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.

Title: 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors

Authors: Matteo Sodano, Federico Magistri, Elias Marks, Fares Hosn, Aibek Zurbayev, Rodrigo Marcuzzi, Meher V. R. Malladi, Jens Behley, Cyrill Stachniss
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.13188
Pdf URL: https://arxiv.org/pdf/2503.13188
Copy Paste: [[2503.13188]] 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors(https://arxiv.org/abs/2503.13188)
Keywords: segmentation
Abstract: Crop yield estimation is a relevant problem in agriculture, because an accurate crop yield estimate can support farmers' decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of plants (a single trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. Additionally, to efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robotic platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset has been made publicly available at this https URL. We will provide the open-source implementation of our approach and public competiton for hierarchical panoptic segmentation on the hidden test sets upon paper acceptance.

Title: Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey

Authors: Haoqi Huang, Ping Wang, Jianhua Pei, Jiacheng Wang, Shahen Alexanian, Dusit Niyato
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13195
Pdf URL: https://arxiv.org/pdf/2503.13195
Copy Paste: [[2503.13195]] Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey(https://arxiv.org/abs/2503.13195)
Keywords: security, interpretability
Abstract: The rapid expansion of data from diverse sources has made anomaly detection (AD) increasingly essential for identifying unexpected observations that may signal system failures, security breaches, or fraud. As datasets become more complex and high-dimensional, traditional detection methods struggle to effectively capture intricate patterns. Advances in deep learning have made AD methods more powerful and adaptable, improving their ability to handle high-dimensional and unstructured data. This survey provides a comprehensive review of over 180 recent studies, focusing on deep learning-based AD techniques. We categorize and analyze these methods into reconstruction-based and prediction-based approaches, highlighting their effectiveness in modeling complex data distributions. Additionally, we explore the integration of traditional and deep learning methods, highlighting how hybrid approaches combine the interpretability of traditional techniques with the flexibility of deep learning to enhance detection accuracy and model transparency. Finally, we identify open issues and propose future research directions to advance the field of AD. This review bridges gaps in existing literature and serves as a valuable resource for researchers and practitioners seeking to enhance AD techniques using deep learning.

Title: Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training

Authors: Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13203
Pdf URL: https://arxiv.org/pdf/2503.13203
Copy Paste: [[2503.13203]] Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training(https://arxiv.org/abs/2503.13203)
Keywords: segmentation
Abstract: Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method achieves performance comparable to current state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. Our method is fully explainable, and requires no learning or parameter tuning. Code is available at this https URL

Title: Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach

Authors: Sinan Fan, Liang Xie, Chen Shen, Ge Teng, Xiaosong Yuan, Xiaofeng Zhang, Chenxi Huang, Wenxiao Wang, Xiaofei He, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13208
Pdf URL: https://arxiv.org/pdf/2503.13208
Copy Paste: [[2503.13208]] Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach(https://arxiv.org/abs/2503.13208)
Keywords: large language model
Abstract: Prompt-tuning (PT) for large language models (LLMs) can facilitate the performance on various conventional NLP tasks with significantly fewer trainable parameters. However, our investigation reveals that PT provides limited improvement and may even degrade the primitive performance of LLMs on complex reasoning tasks. Such a phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the later phases of reasoning. To address these challenges, We first identify an information accumulation within the soft prompts. Through detailed analysis, we demonstrate that this phenomenon is often accompanied by erroneous information flow patterns in the deeper layers of the model, which ultimately lead to incorrect reasoning outcomes. we propose a novel method called \textbf{D}ynamic \textbf{P}rompt \textbf{C}orruption (DPC) to take better advantage of soft prompts in complex reasoning tasks, which dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, DPC consists of two stages: Dynamic Trigger and Dynamic Corruption. First, Dynamic Trigger measures the impact of soft prompts, identifying whether beneficial or detrimental. Then, Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate the proposed approach through extensive experiments on various LLMs and reasoning tasks, including GSM8K, MATH, and AQuA. Experimental results demonstrate that DPC can consistently enhance the performance of PT, achieving 4\%-8\% accuracy gains compared to vanilla prompt tuning, highlighting the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.

Title: MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Isabelle Ayx, Matthias Alexander Fink, Stefan O. Schoenberg, Hans-Ulrich Kauczor, Sandy Engelhardt
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13211
Pdf URL: https://arxiv.org/pdf/2503.13211
Copy Paste: [[2503.13211]] MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis(https://arxiv.org/abs/2503.13211)
Keywords: privacy, diffusion, generative, segmentation
Abstract: Advancements in AI for medical imaging offer significant potential. However, their applications are constrained by the limited availability of data and the reluctance of medical centers to share it due to patient privacy concerns. Generative models present a promising solution by creating synthetic data as a substitute for real patient data. However, medical images are typically high-dimensional, and current state-of-the-art methods are often impractical for computational resource-constrained healthcare environments. These models rely on data sub-sampling, raising doubts about their feasibility and real-world applicability. Furthermore, many of these models are evaluated on quantitative metrics that alone can be misleading in assessing the image quality and clinical meaningfulness of the generated images. To address this, we introduce MedLoRD, a generative diffusion model designed for computational resource-constrained environments. MedLoRD is capable of generating high-dimensional medical volumes with resolutions up to 512$\times$512$\times$256, utilizing GPUs with only 24GB VRAM, which are commonly found in standard desktop workstations. MedLoRD is evaluated across multiple modalities, including Coronary Computed Tomography Angiography and Lung Computed Tomography datasets. Extensive evaluations through radiological evaluation, relative regional volume analysis, adherence to conditional masks, and downstream tasks show that MedLoRD generates high-fidelity images closely adhering to segmentation mask conditions, surpassing the capabilities of current state-of-the-art generative models for medical image synthesis in computational resource-constrained environments.

Title: Can Language Models Follow Multiple Turns of Entangled Instructions?

Authors: Chi Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13222
Pdf URL: https://arxiv.org/pdf/2503.13222
Copy Paste: [[2503.13222]] Can Language Models Follow Multiple Turns of Entangled Instructions?(https://arxiv.org/abs/2503.13222)
Keywords: privacy, protect, large language model
Abstract: Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.

Title: ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction

Authors: Tong Zhou, Shijin Duan, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13224
Pdf URL: https://arxiv.org/pdf/2503.13224
Copy Paste: [[2503.13224]] ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction(https://arxiv.org/abs/2503.13224)
Keywords: secure, security, protect, attack, robust, extraction
Abstract: Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. **ProDiF** quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that **ProDiF** reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65\%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security.

Title: Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch

Authors: Yijie Liu, Xinyi Shang, Yiqun Zhang, Yang Lu, Chen Gong, Jing-Hao Xue, Hanzi Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.13227
Pdf URL: https://arxiv.org/pdf/2503.13227
Copy Paste: [[2503.13227]] Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch(https://arxiv.org/abs/2503.13227)
Keywords: federate
Abstract: Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at this https URL

Title: HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

Authors: Yongkang Cheng, Shaoli Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13229
Pdf URL: https://arxiv.org/pdf/2503.13229
Copy Paste: [[2503.13229]] HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures(https://arxiv.org/abs/2503.13229)
Keywords: robust, diffusion
Abstract: Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at this https URL.

Title: Sampling Innovation-Based Adaptive Compressive Sensing

Authors: Zhifu Tian, Tao Hu, Chaoyang Niu, Di Wu, Shu Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.13241
Pdf URL: https://arxiv.org/pdf/2503.13241
Copy Paste: [[2503.13241]] Sampling Innovation-Based Adaptive Compressive Sensing(https://arxiv.org/abs/2503.13241)
Keywords: robust
Abstract: Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects. Codes are available at this https URL.

Title: TablePilot; Recommending Human-Preferred Tabular Data Analysis with Large Language Models

Authors: Deyin Yi, Yihao Liu, Lang Cao, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13262
Pdf URL: https://arxiv.org/pdf/2503.13262
Copy Paste: [[2503.13262]] TablePilot; Recommending Human-Preferred Tabular Data Analysis with Large Language Models(https://arxiv.org/abs/2503.13262)
Keywords: large language model
Abstract: Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical operations, and the demand for high-quality analysis make the process tedious. To address these challenges, we aim to recommend query-code-result triplets tailored for new tables in tabular data analysis workflows. In this paper, we present TablePilot, a pioneering tabular data analysis framework leveraging large language models to autonomously generate comprehensive and superior analytical results without relying on user profiles or prior interactions. The framework incorporates key designs in analysis preparation and analysis optimization to enhance accuracy. Additionally, we propose Rec-Align, a novel method to further improve recommendation quality and better align with human preferences. Experiments on DART, a dataset specifically designed for comprehensive tabular data analysis recommendation, demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations further highlight its effectiveness in optimizing tabular data analysis workflows.

Title: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Authors: Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13265
Pdf URL: https://arxiv.org/pdf/2503.13265
Copy Paste: [[2503.13265]] FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis(https://arxiv.org/abs/2503.13265)
Keywords: diffusion
Abstract: Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: this https URL.

Title: Graph Generative Models Evaluation with Masked Autoencoder

Authors: Chengen Wang, Murat Kantarcioglu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13271
Pdf URL: https://arxiv.org/pdf/2503.13271
Copy Paste: [[2503.13271]] Graph Generative Models Evaluation with Masked Autoencoder(https://arxiv.org/abs/2503.13271)
Keywords: generative
Abstract: In recent years, numerous graph generative models (GGMs) have been proposed. However, evaluating these models remains a considerable challenge, primarily due to the difficulty in extracting meaningful graph features that accurately represent real-world graphs. The traditional evaluation techniques, which rely on graph statistical properties like node degree distribution, clustering coefficients, or Laplacian spectrum, overlook node features and lack scalability. There are newly proposed deep learning-based methods employing graph random neural networks or contrastive learning to extract graph features, demonstrating superior performance compared to traditional statistical methods, but their experimental results also demonstrate that these methods do not always working well across different metrics. Although there are overlaps among these metrics, they are generally not interchangeable, each evaluating generative models from a different perspective. In this paper, we propose a novel method that leverages graph masked autoencoders to effectively extract graph features for GGM evaluations. We conduct extensive experiments on graphs and empirically demonstrate that our method can be more reliable and effective than previously proposed methods across a number of GGM evaluation metrics, such as "Fréchet Distance (FD)" and "MMD Linear". However, no single method stands out consistently across all metrics and datasets. Therefore, this study also aims to raise awareness of the significance and challenges associated with GGM evaluation techniques, especially in light of recent advances in generative models.

Title: Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Authors: Katja Schwarz, Norman Mueller, Peter Kontschieder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13272
Pdf URL: https://arxiv.org/pdf/2503.13272
Copy Paste: [[2503.13272]] Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors(https://arxiv.org/abs/2503.13272)
Keywords: diffusion, generative
Abstract: Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: this https URL

Title: LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation

Authors: Xiaodi Li, Shaika Chowdhury, Chung Il Wi, Maria Vassilaki, Ken Liu, Terence T Sio, Owen Garrick, Young J Juhn, James R Cerhan, Cui Tao, Nansu Zong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13281
Pdf URL: https://arxiv.org/pdf/2503.13281
Copy Paste: [[2503.13281]] LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation(https://arxiv.org/abs/2503.13281)
Keywords: large language model
Abstract: Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets. We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.

Title: A Survey on Transformer Context Extension: Approaches and Evaluation

Authors: Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13299
Pdf URL: https://arxiv.org/pdf/2503.13299
Copy Paste: [[2503.13299]] A Survey on Transformer Context Extension: Approaches and Evaluation(https://arxiv.org/abs/2503.13299)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.

Title: UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

Authors: Yinqiao Wang, Hao Xu, Pheng-Ann Heng, Chi-Wing Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13303
Pdf URL: https://arxiv.org/pdf/2503.13303
Copy Paste: [[2503.13303]] UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation(https://arxiv.org/abs/2503.13303)
Keywords: robust
Abstract: Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on this https URL.

Title: GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation

Authors: Witold Wydmański, Marek Śmieja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13304
Pdf URL: https://arxiv.org/pdf/2503.13304
Copy Paste: [[2503.13304]] GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation(https://arxiv.org/abs/2503.13304)
Keywords: interpretability
Abstract: Feature selection in deep learning remains a critical challenge, particularly for high-dimensional tabular data where interpretability and computational efficiency are paramount. We present GFSNetwork, a novel neural architecture that performs differentiable feature selection through temperature-controlled Gumbel-Sigmoid sampling. Unlike traditional methods, where the user has to define the requested number of features, GFSNetwork selects it automatically during an end-to-end process. Moreover, GFSNetwork maintains constant computational overhead regardless of the number of input features. We evaluate GFSNetwork on a series of classification and regression benchmarks, where it consistently outperforms recent methods including DeepLasso, attention maps, as well as traditional feature selectors, while using significantly fewer features. Furthermore, we validate our approach on real-world metagenomic datasets, demonstrating its effectiveness in high-dimensional biological data. Concluding, our method provides a scalable solution that bridges the gap between neural network flexibility and traditional feature selection interpretability. We share our python implementation of GFSNetwork at this https URL, as well as a PyPi package (gfs_network).

Title: Computation Mechanism Behind LLM Position Generalization

Authors: Chi Han, Heng Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13305
Pdf URL: https://arxiv.org/pdf/2503.13305
Copy Paste: [[2503.13305]] Computation Mechanism Behind LLM Position Generalization(https://arxiv.org/abs/2503.13305)
Keywords: large language model
Abstract: Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.

Title: MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis

Authors: Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13319
Pdf URL: https://arxiv.org/pdf/2503.13319
Copy Paste: [[2503.13319]] MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis(https://arxiv.org/abs/2503.13319)
Keywords: diffusion, transformer
Abstract: Fine-tuning open-source large-scale VDMs for the portrait video synthesis task can result in significant improvements across multiple dimensions, such as visual quality and natural facial motion dynamics. Despite their advancements, how to achieve step distillation and reduce the substantial computational overhead of large-scale VDMs remains unexplored. To fill this gap, this paper proposes Weak-to-Strong Video Distillation (W2SVD) to mitigate both the issue of insufficient training memory and the problem of training collapse observed in vanilla DMD during the training process. Specifically, we first leverage LoRA to fine-tune the fake diffusion transformer (DiT) to address the out-of-memory issue. Then, we employ the W2S distribution matching to adjust the real DiT's parameter, subtly shifting it toward the fake DiT's parameter. This adjustment is achieved by utilizing the weak weight of the low-rank branch, effectively alleviate the conundrum where the video synthesized by the few-step generator deviates from the real data distribution, leading to inaccuracies in the KL divergence approximation. Additionally, we minimize the distance between the fake data distribution and the ground truth distribution to further enhance the visual quality of the synthesized videos. As experimentally demonstrated on HunyuanVideo, W2SVD surpasses the standard Euler, LCM, DMD and even the 28-step standard sampling in FID/FVD and VBench in 1/4-step video synthesis. The project page is in this https URL.

Title: Edit Transfer: Learning Image Editing via Vision In-Context Relations

Authors: Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13327
Pdf URL: https://arxiv.org/pdf/2503.13327
Copy Paste: [[2503.13327]] Edit Transfer: Learning Image Editing via Vision In-Context Relations(https://arxiv.org/abs/2503.13327)
Keywords: large language model
Abstract: We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.

Title: Valid Text-to-SQL Generation with Unification-based DeepStochLog

Authors: Ying Jiao, Luc De Raedt, Giuseppe Marra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13342
Pdf URL: https://arxiv.org/pdf/2503.13342
Copy Paste: [[2503.13342]] Valid Text-to-SQL Generation with Unification-based DeepStochLog(https://arxiv.org/abs/2503.13342)
Keywords: large language model
Abstract: Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at this https URL.

Title: STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Authors: Shashikant Verma, Harish Katti, Soumyaratna Debnath, Yamuna Swamy, Shanmuganathan Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13344
Pdf URL: https://arxiv.org/pdf/2503.13344
Copy Paste: [[2503.13344]] STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans(https://arxiv.org/abs/2503.13344)
Keywords: transformer
Abstract: We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state initialized through a pre-trained detector or manual initialization in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.

Title: Agents Play Thousands of 3D Video Games

Authors: Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, Wei Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13356
Pdf URL: https://arxiv.org/pdf/2503.13356
Copy Paste: [[2503.13356]] Agents Play Thousands of 3D Video Games(https://arxiv.org/abs/2503.13356)
Keywords: large language model
Abstract: We present PORTAL, a novel framework for developing artificial intelligence agents capable of playing thousands of 3D video games through language-guided policy generation. By transforming decision-making problems into language modeling tasks, our approach leverages large language models (LLMs) to generate behavior trees represented in domain-specific language (DSL). This method eliminates the computational burden associated with traditional reinforcement learning approaches while preserving strategic depth and rapid adaptability. Our framework introduces a hybrid policy structure that combines rule-based nodes with neural network components, enabling both high-level strategic reasoning and precise low-level control. A dual-feedback mechanism incorporating quantitative game metrics and vision-language model analysis facilitates iterative policy improvement at both tactical and strategic levels. The resulting policies are instantaneously deployable, human-interpretable, and capable of generalizing across diverse gaming environments. Experimental results demonstrate PORTAL's effectiveness across thousands of first-person shooter (FPS) games, showcasing significant improvements in development efficiency, policy generalization, and behavior diversity compared to traditional approaches. PORTAL represents a significant advancement in game AI development, offering a practical solution for creating sophisticated agents that can operate across thousands of commercial video games with minimal development overhead. Experiment results on the 3D video games are best viewed on this https URL .

Title: One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Authors: Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13358
Pdf URL: https://arxiv.org/pdf/2503.13358
Copy Paste: [[2503.13358]] One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation(https://arxiv.org/abs/2503.13358)
Keywords: diffusion
Abstract: Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

Title: Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

Authors: Hai-Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13360
Pdf URL: https://arxiv.org/pdf/2503.13360
Copy Paste: [[2503.13360]] Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning(https://arxiv.org/abs/2503.13360)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

Title: SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Authors: Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13371
Pdf URL: https://arxiv.org/pdf/2503.13371
Copy Paste: [[2503.13371]] SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization(https://arxiv.org/abs/2503.13371)
Keywords: diffusion
Abstract: Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Title: Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Authors: Mengyao Lyu, Yan Li, Huasong Zhong, Wenhao Yang, Hui Chen, Jungong Han, Guiguang Ding, Zhenheng Yang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13383
Pdf URL: https://arxiv.org/pdf/2503.13383
Copy Paste: [[2503.13383]] Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning(https://arxiv.org/abs/2503.13383)
Keywords: robust, large language model
Abstract: The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

Title: Scale Efficient Training for Large Datasets

Authors: Qing Zhou, Junyu Gao, Qi Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13385
Pdf URL: https://arxiv.org/pdf/2503.13385
Copy Paste: [[2503.13385]] Scale Efficient Training for Large Datasets(https://arxiv.org/abs/2503.13385)
Keywords: transformer, segmentation
Abstract: The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model this http URL address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard this http URL conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million this http URL reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at this https URL.

Title: Aligned Probing: Relating Toxic Behavior and Model Internals

Authors: Andreas Waldis, Vagrant Gautam, Anne Lauscher, Dietrich Klakow, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13390
Pdf URL: https://arxiv.org/pdf/2503.13390
Copy Paste: [[2503.13390]] Aligned Probing: Relating Toxic Behavior and Model Internals(https://arxiv.org/abs/2503.13390)
Keywords: interpretability
Abstract: We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.

Title: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, q-bio.CB
Abstract URL: https://arxiv.org/abs/2503.13399
Pdf URL: https://arxiv.org/pdf/2503.13399
Copy Paste: [[2503.13399]] MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research(https://arxiv.org/abs/2503.13399)
Keywords: large language model
Abstract: Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at this https URL, and project page at this https URL.

Title: Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Authors: Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13401
Pdf URL: https://arxiv.org/pdf/2503.13401
Copy Paste: [[2503.13401]] Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis(https://arxiv.org/abs/2503.13401)
Keywords: large language model
Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.

Title: DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective

Authors: Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13413
Pdf URL: https://arxiv.org/pdf/2503.13413
Copy Paste: [[2503.13413]] DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective(https://arxiv.org/abs/2503.13413)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts. However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability. To mitigate this, recent studies have explored automated prompt optimization as a promising solution. Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm. Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization. Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation. We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization. Our code is available at this https URL.

Title: Securing Virtual Reality Experiences: Unveiling and Tackling Cybersickness Attacks with Explainable AI

Authors: Ripan Kumar Kundu, Matthew Denton, Genova Mongalo, Prasad Calyam, Khaza Anuarul Hoque
Subjects: cs.CR, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2503.13419
Pdf URL: https://arxiv.org/pdf/2503.13419
Copy Paste: [[2503.13419]] Securing Virtual Reality Experiences: Unveiling and Tackling Cybersickness Attacks with Explainable AI(https://arxiv.org/abs/2503.13419)
Keywords: attack
Abstract: The synergy between virtual reality (VR) and artificial intelligence (AI), specifically deep learning (DL)-based cybersickness detection models, has ushered in unprecedented advancements in immersive experiences by automatically detecting cybersickness severity and adaptively various mitigation techniques, offering a smooth and comfortable VR experience. While this DL-enabled cybersickness detection method provides promising solutions for enhancing user experiences, it also introduces new risks since these models are vulnerable to adversarial attacks; a small perturbation of the input data that is visually undetectable to human observers can fool the cybersickness detection model and trigger unexpected mitigation, thus disrupting user immersive experiences (UIX) and even posing safety risks. In this paper, we present a new type of VR attack, i.e., a cybersickness attack, which successfully stops the triggering of cybersickness mitigation by fooling DL-based cybersickness detection models and dramatically hinders the UIX. Next, we propose a novel explainable artificial intelligence (XAI)-guided cybersickness attack detection framework to detect such attacks in VR to ensure UIX and a comfortable VR experience. We evaluate the proposed attack and the detection framework using two state-of-the-art open-source VR cybersickness datasets: Simulation 2021 and Gameplay dataset. Finally, to verify the effectiveness of our proposed method, we implement the attack and the XAI-based detection using a testbed with a custom-built VR roller coaster simulation with an HTC Vive Pro Eye headset and perform a user study. Our study shows that such an attack can dramatically hinder the UIX. However, our proposed XAI-guided cybersickness attack detection can successfully detect cybersickness attacks and trigger the proper mitigation, effectively reducing VR cybersickness.

Title: SuperBPE: Space Travel for Language Models

Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13423
Pdf URL: https://arxiv.org/pdf/2503.13423
Copy Paste: [[2503.13423]] SuperBPE: Space Travel for Language Models(https://arxiv.org/abs/2503.13423)
Keywords: transformer, segmentation
Abstract: The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.

Title: Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

Authors: Xinyu Lian, Zichao Yu, Ruiming Liang, Yitong Wang, Li Ray Luo, Kaixu Chen, Yuanzhen Zhou, Qihong Tang, Xudong Xu, Zhaoyang Lyu, Bo Dai, Jiangmiao Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13424
Pdf URL: https://arxiv.org/pdf/2503.13424
Copy Paste: [[2503.13424]] Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation(https://arxiv.org/abs/2503.13424)
Keywords: generative
Abstract: Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at this https URL

Title: xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Authors: Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.13427
Pdf URL: https://arxiv.org/pdf/2503.13427
Copy Paste: [[2503.13427]] xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference(https://arxiv.org/abs/2503.13427)
Keywords: transformer, large language model
Abstract: Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Title: Escaping Plato's Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes

Authors: Nhi Pham, Bernt Schiele, Adam Kortylewski, Jonas Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13429
Pdf URL: https://arxiv.org/pdf/2503.13429
Copy Paste: [[2503.13429]] Escaping Plato's Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes(https://arxiv.org/abs/2503.13429)
Keywords: robust, interpretability
Abstract: With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.

Title: Measuring In-Context Computation Complexity via Hidden State Prediction

Authors: Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13431
Pdf URL: https://arxiv.org/pdf/2503.13431
Copy Paste: [[2503.13431]] Measuring In-Context Computation Complexity via Hidden State Prediction(https://arxiv.org/abs/2503.13431)
Keywords: transformer
Abstract: Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric -- in contrast to the next token prediction loss -- correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

Title: Uncovering Utility Functions from Observed Outcomes

Authors: Marta Grzeskiewicz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13432
Pdf URL: https://arxiv.org/pdf/2503.13432
Copy Paste: [[2503.13432]] Uncovering Utility Functions from Observed Outcomes(https://arxiv.org/abs/2503.13432)
Keywords: extraction
Abstract: Determining consumer preferences and utility is a foundational challenge in economics. They are central in determining consumer behaviour through the utility-maximising consumer decision-making process. However, preferences and utilities are not observable and may not even be known to the individual making the choice; only the outcome is observed in the form of demand. Without the ability to observe the decision-making mechanism, demand estimation becomes a challenging task and current methods fall short due to lack of scalability or ability to identify causal effects. Estimating these effects is critical when considering changes in policy, such as pricing, the impact of taxes and subsidies, and the effect of a tariff. To address the shortcomings of existing methods, we combine revealed preference theory and inverse reinforcement learning to present a novel algorithm, Preference Extraction and Reward Learning (PEARL) which, to the best of our knowledge, is the only algorithm that can uncover a representation of the utility function that best rationalises observed consumer choice data given a specified functional form. We introduce a flexible utility function, the Input-Concave Neural Network which captures complex relationships across goods, including cross-price elasticities. Results show PEARL outperforms the benchmark on both noise-free and noisy synthetic data.

Title: Less Biased Noise Scale Estimation for Threshold-Robust RANSAC

Authors: Johan Edstedt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13433
Pdf URL: https://arxiv.org/pdf/2503.13433
Copy Paste: [[2503.13433]] Less Biased Noise Scale Estimation for Threshold-Robust RANSAC(https://arxiv.org/abs/2503.13433)
Keywords: robust
Abstract: The gold-standard for robustly estimating relative pose through image matching is RANSAC. While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set. Setting this threshold is typically done by hand, and is difficult to tune without a access to ground truth data. Thus, a method capable of automatically determining the optimal threshold would be desirable. In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold. We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale. In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account. Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results. Our approach yields robust performance across a range of thresholds, shown in Figure 1.

Title: BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Authors: Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.13434
Pdf URL: https://arxiv.org/pdf/2503.13434
Copy Paste: [[2503.13434]] BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing(https://arxiv.org/abs/2503.13434)
Keywords: diffusion
Abstract: Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: this https URL

Title: Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Authors: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13439
Pdf URL: https://arxiv.org/pdf/2503.13439
Copy Paste: [[2503.13439]] Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images(https://arxiv.org/abs/2503.13439)
Keywords: generative
Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Title: MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

Authors: Yingyue Li, Bencheng Liao, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13440
Pdf URL: https://arxiv.org/pdf/2503.13440
Copy Paste: [[2503.13440]] MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling(https://arxiv.org/abs/2503.13440)
Keywords: transformer
Abstract: With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at this http URL.

Title: DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models

Authors: Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, Guodong Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13443
Pdf URL: https://arxiv.org/pdf/2503.13443
Copy Paste: [[2503.13443]] DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models(https://arxiv.org/abs/2503.13443)
Keywords: interpretability
Abstract: The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve base performance without introducing any external knowledge beyond the base classes, while maintaining generalization to new classes. Code is available at: this https URL.

Title: VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Authors: Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13444
Pdf URL: https://arxiv.org/pdf/2503.13444
Copy Paste: [[2503.13444]] VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning(https://arxiv.org/abs/2503.13444)
Keywords: large language model
Abstract: Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

Title: Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance

Authors: Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13445
Pdf URL: https://arxiv.org/pdf/2503.13445
Copy Paste: [[2503.13445]] Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance(https://arxiv.org/abs/2503.13445)
Keywords: large language model
Abstract: As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.

Title: MetaScale: Test-Time Scaling with Evolving Meta-Thoughts

Authors: Qin Liu, Wenxuan Zhou, Nan Xu, James Y. Huang, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13447
Pdf URL: https://arxiv.org/pdf/2503.13447
Copy Paste: [[2503.13447]] MetaScale: Test-Time Scaling with Evolving Meta-Thoughts(https://arxiv.org/abs/2503.13447)
Keywords: large language model
Abstract: One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts -- adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.