2025-03-26

Title: A Survey on Structured State Space Sequence (S4) Models

Authors: Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, Subasish Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18970
Pdf URL: https://arxiv.org/pdf/2503.18970
Copy Paste: [[2503.18970]] A Survey on Structured State Space Sequence (S4) Models(https://arxiv.org/abs/2503.18970)
Keywords: interpretability, transformer
Abstract: Recent advancements in sequence modeling have led to the emergence of Structured State Space Models (SSMs) as an efficient alternative to Recurrent Neural Networks (RNNs) and Transformers, addressing challenges in long-range dependency modeling and computational efficiency. While RNNs suffer from vanishing gradients and sequential inefficiencies, and Transformers face quadratic complexity, SSMs leverage structured recurrence and state-space representations to achieve superior long-sequence processing with linear or near-linear complexity. This survey provides a comprehensive review of SSMs, tracing their evolution from the foundational S4 model to its successors like Mamba, Simplified Structured State Space Sequence Model (S5), and Jamba, highlighting their improvements in computational efficiency, memory optimization, and inference speed. By comparing SSMs with traditional sequence models across domains such as natural language processing (NLP), speech recognition, vision, and time-series forecasting, we demonstrate their advantages in handling long-range dependencies while reducing computational overhead. Despite their potential, challenges remain in areas such as training optimization, hybrid modeling, and interpretability. This survey serves as a structured guide for researchers and practitioners, detailing the advancements, trade-offs, and future directions of SSM-based architectures in AI and deep learning.

Title: FedSKD: Aggregation-free Model-heterogeneous Federated Learning using Multi-dimensional Similarity Knowledge Distillation

Authors: Ziqiao Weng, Weidong Cai, Bo Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18981
Pdf URL: https://arxiv.org/pdf/2503.18981
Copy Paste: [[2503.18981]] FedSKD: Aggregation-free Model-heterogeneous Federated Learning using Multi-dimensional Similarity Knowledge Distillation(https://arxiv.org/abs/2503.18981)
Keywords: privacy, robust, federate
Abstract: Federated learning (FL) enables privacy-preserving collaborative model training without direct data sharing. Model-heterogeneous FL (MHFL) extends this paradigm by allowing clients to train personalized models with heterogeneous architectures tailored to their computational resources and application-specific needs. However, existing MHFL methods predominantly rely on centralized aggregation, which introduces scalability and efficiency bottlenecks, or impose restrictions requiring partially identical model architectures across clients. While peer-to-peer (P2P) FL removes server dependence, it suffers from model drift and knowledge dilution, limiting its effectiveness in heterogeneous settings. To address these challenges, we propose FedSKD, a novel MHFL framework that facilitates direct knowledge exchange through round-robin model circulation, eliminating the need for centralized aggregation while allowing fully heterogeneous model architectures across clients. FedSKD's key innovation lies in multi-dimensional similarity knowledge distillation, which enables bidirectional cross-client knowledge transfer at batch, pixel/voxel, and region levels for heterogeneous models in FL. This approach mitigates catastrophic forgetting and model drift through progressive reinforcement and distribution alignment while preserving model heterogeneity. Extensive evaluations on fMRI-based autism spectrum disorder diagnosis and skin lesion classification demonstrate that FedSKD outperforms state-of-the-art heterogeneous and homogeneous FL baselines, achieving superior personalization (client-specific accuracy) and generalization (cross-institutional adaptability). These findings underscore FedSKD's potential as a scalable and robust solution for real-world medical federated learning applications.

Title: Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks

Authors: Liang Zhang, Jionghao Lin, John Sabatini, Diego Zapata-Rivera, Carol Forsyth, Yang Jiang, John Hollander, Xiangen Hu, Arthur C. Graesser
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18982
Pdf URL: https://arxiv.org/pdf/2503.18982
Copy Paste: [[2503.18982]] Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks(https://arxiv.org/abs/2503.18982)
Keywords: robust, generative
Abstract: Learner performance data collected by Intelligent Tutoring Systems (ITSs), such as responses to questions, is essential for modeling and predicting learners' knowledge states. However, missing responses due to skips or incomplete attempts create data sparsity, challenging accurate assessment and personalized instruction. To address this, we propose a generative imputation approach using Generative Adversarial Imputation Networks (GAIN). Our method features a three-dimensional (3D) framework (learners, questions, and attempts), flexibly accommodating various sparsity levels. Enhanced by convolutional neural networks and optimized with a least squares loss function, the GAIN-based method aligns input and output dimensions to question-attempt matrices along the learners' dimension. Extensive experiments using datasets from AutoTutor Adult Reading Comprehension (ARC), ASSISTments, and MATHia demonstrate that our approach significantly outperforms tensor factorization and alternative GAN methods in imputation accuracy across different attempt scenarios. Bayesian Knowledge Tracing (BKT) further validates the effectiveness of the imputed data by estimating learning parameters: initial knowledge (P(L0)), learning rate (P(T)), guess rate (P(G)), and slip rate (P(S)). Results indicate the imputed data enhances model fit and closely mirrors original distributions, capturing underlying learning behaviors reliably. Kullback-Leibler (KL) divergence assessments confirm minimal divergence, showing the imputed data preserves essential learning characteristics effectively. These findings underscore GAIN's capability as a robust imputation tool in ITSs, alleviating data sparsity and supporting adaptive, individualized instruction, ultimately leading to more precise and responsive learner assessments and improved educational outcomes.

Title: SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices

Authors: Jian Ma, Xinchen Lyu, Jun Jiang, Qimei Cui, Haipeng Yao, Xiaofeng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18986
Pdf URL: https://arxiv.org/pdf/2503.18986
Copy Paste: [[2503.18986]] SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices(https://arxiv.org/abs/2503.18986)
Keywords: large language model
Abstract: Fine-tuning large language models (LLMs) on private, on-device data can empower tailored personalized AI agents. However, fine-tuning LLMs on resource-constrained edge devices faces significant challenges, including excessive computation overhead, device heterogeneity, and data imbalance. This paper proposes SplitFrozen, a split learning framework that enables efficient LLM fine-tuning by strategically freezing device-side model layers while centralizing parameter-efficient fine-tuning on the server. Our framework partitions LLMs into device-side frozen layers and server-side fine-tuning layers, where heterogeneous resource-constrained devices execute only forward propagation. To minimize server-side training costs, we integrate Low-Rank Adaptation (LoRA) into the server-side layers. A pipeline parallelism strategy further optimizes training efficiency by decoupling device-server computations and leveraging decomposed backward propagation. Experiments on GPT-2 with the MRPC, MNLI-matched, and SST-2 datasets demonstrate that SplitFrozen outperforms FedLoRA and SplitLoRA by 69.4\% model accuracy under extremely imbalanced data, while reducing up to 86.8\% device-side computations and 50.2\% total training time. Experiments also validate the scalability of SplitFrozen on content generation task using Llama-3.2 model on GSM8K dataset.

Title: A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models

Authors: Zuan Xie, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18989
Pdf URL: https://arxiv.org/pdf/2503.18989
Copy Paste: [[2503.18989]] A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models(https://arxiv.org/abs/2503.18989)
Keywords: privacy, large language model
Abstract: Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy. To address these limitations, we propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HAT partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM's decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HAT exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency. To improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HAT is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.

Title: SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment

Authors: Ruoxi Cheng, Shuirong Cao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18991
Pdf URL: https://arxiv.org/pdf/2503.18991
Copy Paste: [[2503.18991]] SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment(https://arxiv.org/abs/2503.18991)
Keywords: attack, membership infer, large language model
Abstract: Aligning large language models (LLMs) with human preferences and values is vital for application. However, current alignment methods face three main limitations: (1) reliance on costly human annotation; (2) alignment tax; (3) shallow alignment vulnerable to jailbreak attacks. Additionally, current alignment datasets often suffer from uneven distributions, leading to overrepresentation of some topics and neglect of others. To address these issues, we propose SRMIR (Shadow Reward Models Based on Introspective Reasoning), inspired by shadow models in membership inference attacks. We first construct a balanced safety Chain of Draft (CoD) dataset across $7$ harmful types with structured prompt leveraging the introspective reasoning capabilities of LLMs, then train a set of specialized reward models to guide policy optimization through Group Relative Policy Optimization (GRPO). We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization. By comparison, we find that the latter achieves superior alignment despite higher computational costs. Experiments across several LLMs demonstrate SRMIR significantly outperforms existing methods.

Title: Improving Food Image Recognition with Noisy Vision Transformer

Authors: Tonmoy Ghosh, Edward Sazonov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18997
Pdf URL: https://arxiv.org/pdf/2503.18997
Copy Paste: [[2503.18997]] Improving Food Image Recognition with Noisy Vision Transformer(https://arxiv.org/abs/2503.18997)
Keywords: transformer
Abstract: Food image recognition is a challenging task in computer vision due to the high variability and complexity of food images. In this study, we investigate the potential of Noisy Vision Transformers (NoisyViT) for improving food classification performance. By introducing noise into the learning process, NoisyViT reduces task complexity and adjusts the entropy of the system, leading to enhanced model accuracy. We fine-tune NoisyViT on three benchmark datasets: Food2K (2,000 categories, ~1M images), Food-101 (101 categories, ~100K images), and CNFOOD-241 (241 categories, ~190K images). The performance of NoisyViT is evaluated against state-of-the-art food recognition models. Our results demonstrate that NoisyViT achieves Top-1 accuracies of 95%, 99.5%, and 96.6% on Food2K, Food-101, and CNFOOD-241, respectively, significantly outperforming existing approaches. This study underscores the potential of NoisyViT for dietary assessment, nutritional monitoring, and healthcare applications, paving the way for future advancements in vision-based food computing. Code for reproducing NoisyViT for food recognition is available at NoisyViT_Food.

Title: DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

Authors: Kangwei Liu, Junwu Liu, Yun Cao, Jinlin Guo, Xiaowei Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19001
Pdf URL: https://arxiv.org/pdf/2503.19001
Copy Paste: [[2503.19001]] DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model(https://arxiv.org/abs/2503.19001)
Keywords: diffusion
Abstract: Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: this https URL.

Title: Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning

Authors: Chak Lam Shek, Pratap Tokekar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19007
Pdf URL: https://arxiv.org/pdf/2503.19007
Copy Paste: [[2503.19007]] Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning(https://arxiv.org/abs/2503.19007)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making, yet their integration with Reinforcement Learning (RL) for complex robotic tasks remains underexplored. In this paper, we propose an LLM-guided hierarchical RL framework, termed LDSC, that leverages LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability. Traditional RL methods often suffer from inefficient exploration and high computational cost. Hierarchical RL helps with these challenges, but existing methods often fail to reuse options effectively when faced with new tasks. To address these limitations, we introduce a three-stage framework that uses LLMs for subgoal generation given natural language description of the task, a reusable option learning and selection method, and an action-level policy, enabling more effective decision-making across diverse tasks. By incorporating LLMs for subgoal prediction and policy guidance, our approach improves exploration efficiency and enhances learning performance. On average, LDSC outperforms the baseline by 55.9\% in average reward, demonstrating its effectiveness in complex RL settings. More details and experiment videos could be found in \href{this https URL}{this link\footnote{this https URL}}.

Title: RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

Authors: Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Jiaao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19011
Pdf URL: https://arxiv.org/pdf/2503.19011
Copy Paste: [[2503.19011]] RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis(https://arxiv.org/abs/2503.19011)
Keywords: robust, diffusion
Abstract: Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.

Title: DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding

Authors: Lingyan Ran, Lidong Wang, Guangcong Wang, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19012
Pdf URL: https://arxiv.org/pdf/2503.19012
Copy Paste: [[2503.19012]] DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding(https://arxiv.org/abs/2503.19012)
Keywords: diffusion
Abstract: The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at this https URL.

Title: strideSEA: A STRIDE-centric Security Evaluation Approach

Authors: Alvi Jawad, Jason Jaskolka, Ashraf Matrawy, Mohamed Ibnkahla
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19030
Pdf URL: https://arxiv.org/pdf/2503.19030
Copy Paste: [[2503.19030]] strideSEA: A STRIDE-centric Security Evaluation Approach(https://arxiv.org/abs/2503.19030)
Keywords: secure, security, attack
Abstract: Microsoft's STRIDE methodology is at the forefront of threat modeling, supporting the increasingly critical quality attribute of security in software-intensive systems. However, in a comprehensive security evaluation process, the general consensus is that the STRIDE classification is only useful for threat elicitation, isolating threat modeling from the other security evaluation activities involved in a secure software development life cycle (SDLC). We present strideSEA, a STRIDE-centric Security Evaluation Approach that integrates STRIDE as the central classification scheme into the security activities of threat modeling, attack scenario analysis, risk analysis, and countermeasure recommendation that are conducted alongside software engineering activities in secure SDLCs. The application of strideSEA is demonstrated in a real-world online immunization system case study. Using STRIDE as a single unifying thread, we bind existing security evaluation approaches in the four security activities of strideSEA to analyze (1) threats using Microsoft threat modeling tool, (2) attack scenarios using attack trees, (3) systemic risk using NASA's defect detection and prevention (DDP) technique, and (4) recommend countermeasures based on their effectiveness in reducing the most critical risks using DDP. The results include a detailed quantitative assessment of the security of the online immunization system with a clear definition of the role and advantages of integrating STRIDE in the evaluation process. Overall, the unified approach in strideSEA enables a more structured security evaluation process, allowing easier identification and recommendation of countermeasures, thus supporting the security requirements and eliciting design considerations, informing the software development life cycle of future software-based information systems.

Title: Color Conditional Generation with Sliced Wasserstein Guidance

Authors: Alexander Lobashev, Maria Larchenko, Dmitry Guskov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19034
Pdf URL: https://arxiv.org/pdf/2503.19034
Copy Paste: [[2503.19034]] Color Conditional Generation with Sliced Wasserstein Guidance(https://arxiv.org/abs/2503.19034)
Keywords: diffusion
Abstract: We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at this https URL.

Title: LookAhead Tuning: Safer Language Models via Partial Answer Previews

Authors: Kangwei Liu, Mengru Wang, Yujie Luo, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.19041
Pdf URL: https://arxiv.org/pdf/2503.19041
Copy Paste: [[2503.19041]] LookAhead Tuning: Safer Language Models via Partial Answer Previews(https://arxiv.org/abs/2503.19041)
Keywords: robust, large language model
Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at this https URL.

Title: Coding Malware in Fancy Programming Languages for Fun and Profit

Authors: Theodoros Apostolopoulos, Vasilios Koutsokostas, Nikolaos Totosis, Constantinos Patsakis, Georgios Smaragdakis
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19058
Pdf URL: https://arxiv.org/pdf/2503.19058
Copy Paste: [[2503.19058]] Coding Malware in Fancy Programming Languages for Fun and Profit(https://arxiv.org/abs/2503.19058)
Keywords: defense, robust
Abstract: The continuous increase in malware samples, both in sophistication and number, presents many challenges for organizations and analysts, who must cope with thousands of new heterogeneous samples daily. This requires robust methods to quickly determine whether a file is malicious. Due to its speed and efficiency, static analysis is the first line of defense. In this work, we illustrate how the practical state-of-the-art methods used by antivirus solutions may fail to detect evident malware traces. The reason is that they highly depend on very strict signatures where minor deviations prevent them from detecting shellcodes that otherwise would immediately be flagged as malicious. Thus, our findings illustrate that malware authors may drastically decrease the detections by converting the code base to less-used programming languages. To this end, we study the features that such programming languages introduce in executables and the practical issues that arise for practitioners to detect malicious activity.

Title: Graph-Level Label-Only Membership Inference Attack against Graph Neural Networks

Authors: Jiazhu Dai, Yubing Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19070
Pdf URL: https://arxiv.org/pdf/2503.19070
Copy Paste: [[2503.19070]] Graph-Level Label-Only Membership Inference Attack against Graph Neural Networks(https://arxiv.org/abs/2503.19070)
Keywords: attack, robust, membership infer
Abstract: Graph neural networks (GNNs) are widely used for graph-structured data but are vulnerable to membership inference attacks (MIAs) in graph classification tasks, which determine if a graph was part of the training dataset, potentially causing data leakage. Existing MIAs rely on prediction probability vectors, but they become ineffective when only prediction labels are available. We propose a Graph-level Label-Only Membership Inference Attack (GLO-MIA), which is based on the intuition that the target model's predictions on training data are more stable than those on testing data. GLO-MIA generates a set of perturbed graphs for target graph by adding perturbations to its effective features and queries the target model with the perturbed graphs to get their prediction labels, which are then used to calculate robustness score of the target graph. Finally, by comparing the robustness score with a predefined threshold, the membership of the target graph can be inferred correctly with high probability. Our evaluation on three datasets and four GNN models shows that GLO-MIA achieves an attack accuracy of up to 0.825, outperforming baseline work by 8.5% and closely matching the performance of probability-based MIAs, even with only prediction labels.

Title: HingeRLC-GAN: Combating Mode Collapse with Hinge Loss and RLC Regularization

Authors: Osman Goni, Himadri Saha Arka, Mithun Halder, Mir Moynuddin Ahmed Shibly, Swakkhar Shatabda
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19074
Pdf URL: https://arxiv.org/pdf/2503.19074
Copy Paste: [[2503.19074]] HingeRLC-GAN: Combating Mode Collapse with Hinge Loss and RLC Regularization(https://arxiv.org/abs/2503.19074)
Keywords: generative
Abstract: Recent advances in Generative Adversarial Networks (GANs) have demonstrated their capability for producing high-quality images. However, a significant challenge remains mode collapse, which occurs when the generator produces a limited number of data patterns that do not reflect the diversity of the training dataset. This study addresses this issue by proposing a number of architectural changes aimed at increasing the diversity and stability of GAN models. We start by improving the loss function with Wasserstein loss and Gradient Penalty to better capture the full range of data variations. We also investigate various network architectures and conclude that ResNet significantly contributes to increased diversity. Building on these findings, we introduce HingeRLC-GAN, a novel approach that combines RLC Regularization and the Hinge loss function. With a FID Score of 18 and a KID Score of 0.001, our approach outperforms existing methods by effectively balancing training stability and increased diversity.

Title: Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training

Authors: Amin Totounferoush, Serge Kotchourko, Michael W. Mahoney, Steffen Staab
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19081
Pdf URL: https://arxiv.org/pdf/2503.19081
Copy Paste: [[2503.19081]] Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training(https://arxiv.org/abs/2503.19081)
Keywords: robust, diffusion
Abstract: Partial differential equations (PDEs) govern a wide range of physical systems, but solving them efficiently remains a major challenge. The idea of a scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains. However, SciFMs require large amounts of solution data, which may be scarce or computationally expensive to generate. To maximize generalization while reducing data dependence, we propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data. We evaluate this constraint-aware pre-training across three key benchmarks: (i) generalization to new physics, where material properties, e.g., the diffusion coefficient, is shifted with respect to the training distribution; (ii) generalization to entirely new PDEs, requiring adaptation to different operators; and (iii) robustness against noisy fine-tuning data, ensuring stability in real-world applications. Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data across all benchmarks. These findings prove the effectiveness of our proposed constraint-aware pre-training as a crucial component for SciFMs, providing a scalable approach to data-efficient, generalizable PDE solvers.

Title: LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

Authors: Varsha Embar, Ritvik Shrivastava, Vinay Damodaran, Travis Mehlinger, Yu-Chung Hsiao, Karthik Raghunathan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19090
Pdf URL: https://arxiv.org/pdf/2503.19090
Copy Paste: [[2503.19090]] LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment(https://arxiv.org/abs/2503.19090)
Keywords: extraction, large language model
Abstract: Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.

Title: Uncertainty-Aware Decomposed Hybrid Networks

Authors: Sina Ditzel, Achref Jaziri, Iuliia Pliushch, Visvanathan Ramesh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19096
Pdf URL: https://arxiv.org/pdf/2503.19096
Copy Paste: [[2503.19096]] Uncertainty-Aware Decomposed Hybrid Networks(https://arxiv.org/abs/2503.19096)
Keywords: robust, interpretability
Abstract: The robustness of image recognition algorithms remains a critical challenge, as current models often depend on large quantities of labeled data. In this paper, we propose a hybrid approach that combines the adaptability of neural networks with the interpretability, transparency, and robustness of domain-specific quasi-invariant operators. Our method decomposes the recognition into multiple task-specific operators that focus on different characteristics, supported by a novel confidence measurement tailored to these operators. This measurement enables the network to prioritize reliable features and accounts for noise. We argue that our design enhances transparency and robustness, leading to improved performance, particularly in low-data regimes. Experimental results in traffic sign detection highlight the effectiveness of the proposed method, especially in semi-supervised and unsupervised scenarios, underscoring its potential for data-constrained applications.

Title: Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification

Authors: Kenneth Alperin, Rohan Leekha, Adaku Uchendu, Trang Nguyen, Srilakshmi Medarametla, Carlos Levya Capote, Seth Aycock, Charlie Dagli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19099
Pdf URL: https://arxiv.org/pdf/2503.19099
Copy Paste: [[2503.19099]] Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification(https://arxiv.org/abs/2503.19099)
Keywords: security, defense, attack, robust, large language model
Abstract: The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textit{authorship obfuscation} and targeted methods - \textit{authorship impersonation}. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts' semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92\% and 78\% for both obfuscation and impersonation attacks, respectively.

Title: Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics

Authors: Md. Barkat Ullah Tusher, Shartaz Khan Akash, Amirul Islam Showmik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19100
Pdf URL: https://arxiv.org/pdf/2503.19100
Copy Paste: [[2503.19100]] Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics(https://arxiv.org/abs/2503.19100)
Keywords: security, robust, extraction
Abstract: This paper showcases an experimental study on anomaly detection using computer vision. The study focuses on class distinction and performance evaluation, combining OpenCV with deep learning techniques while employing a TensorFlow-based convolutional neural network for real-time face recognition and classification. The system effectively distinguishes among three classes: authorized personnel (admin), intruders, and non-human entities. A MobileNetV2-based deep learning model is utilized to optimize real-time performance, ensuring high computational efficiency without compromising accuracy. Extensive dataset preprocessing, including image augmentation and normalization, enhances the models generalization capabilities. Our analysis demonstrates classification accuracies of 90.20% for admin, 98.60% for intruders, and 75.80% for non-human detection, while maintaining an average processing rate of 30 frames per second. The study leverages transfer learning, batch normalization, and Adam optimization to achieve stable and robust learning, and a comparative analysis of class differentiation strategies highlights the impact of feature extraction techniques and training methodologies. The results indicate that advanced feature selection and data augmentation significantly enhance detection performance, particularly in distinguishing human from non-human scenes. As an experimental study, this research provides critical insights into optimizing deep learning-based surveillance systems for high-security environments and improving the accuracy and efficiency of real-time anomaly detection.

Title: Your ViT is Secretly an Image Segmentation Model

Authors: Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19108
Pdf URL: https://arxiv.org/pdf/2503.19108
Copy Paste: [[2503.19108]] Your ViT is Secretly an Image Segmentation Model(https://arxiv.org/abs/2503.19108)
Keywords: transformer, segmentation
Abstract: Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: this https URL.

Title: Understanding and Improving Information Preservation in Prompt Compression for LLMs

Authors: Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19114
Pdf URL: https://arxiv.org/pdf/2503.19114
Copy Paste: [[2503.19114]] Understanding and Improving Information Preservation in Prompt Compression for LLMs(https://arxiv.org/abs/2503.19114)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness -- up to +23\% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.

Title: Where is this coming from? Making groundedness count in the evaluation of Document VQA models

Authors: Armineh Nourbakhsh, Siddharth Parekh, Pranav Shetty, Zhao Jin, Sameena Shah, Carolyn Rose
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19120
Pdf URL: https://arxiv.org/pdf/2503.19120
Copy Paste: [[2503.19120]] Where is this coming from? Making groundedness count in the evaluation of Document VQA models(https://arxiv.org/abs/2503.19120)
Keywords: robust
Abstract: Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model's outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model's robustness and tends to give higher rewards to better-calibrated answers.

Title: Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Authors: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19123
Pdf URL: https://arxiv.org/pdf/2503.19123
Copy Paste: [[2503.19123]] Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling(https://arxiv.org/abs/2503.19123)
Keywords: robust
Abstract: Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Title: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Authors: Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.19134
Pdf URL: https://arxiv.org/pdf/2503.19134
Copy Paste: [[2503.19134]] MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks(https://arxiv.org/abs/2503.19134)
Keywords: attack, robust, diffusion, large language model
Abstract: While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Title: Activation Functions Considered Harmful: Recovering Neural Network Weights through Controlled Channels

Authors: Jesse Spielman, David Oswald, Mark Ryan, Jo Van Bulck
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19142
Pdf URL: https://arxiv.org/pdf/2503.19142
Copy Paste: [[2503.19142]] Activation Functions Considered Harmful: Recovering Neural Network Weights through Controlled Channels(https://arxiv.org/abs/2503.19142)
Keywords: secure, privacy, protect, attack, steal
Abstract: With high-stakes machine learning applications increasingly moving to untrusted end-user or cloud environments, safeguarding pre-trained model parameters becomes essential for protecting intellectual property and user privacy. Recent advancements in hardware-isolated enclaves, notably Intel SGX, hold the promise to secure the internal state of machine learning applications even against compromised operating systems. However, we show that privileged software adversaries can exploit input-dependent memory access patterns in common neural network activation functions to extract secret weights and biases from an SGX enclave. Our attack leverages the SGX-Step framework to obtain a noise-free, instruction-granular page-access trace. In a case study of an 11-input regression network using the Tensorflow Microlite library, we demonstrate complete recovery of all first-layer weights and biases, as well as partial recovery of parameters from deeper layers under specific conditions. Our novel attack technique requires only 20 queries per input per weight to obtain all first-layer weights and biases with an average absolute error of less than 1%, improving over prior model stealing attacks. Additionally, a broader ecosystem analysis reveals the widespread use of activation functions with input-dependent memory access patterns in popular machine learning frameworks (either directly or via underlying math libraries). Our findings highlight the limitations of deploying confidential models in SGX enclaves and emphasise the need for stricter side-channel validation of machine learning implementations, akin to the vetting efforts applied to secure cryptographic libraries.

Title: Compositional Caching for Training-free Open-vocabulary Attribute Detection

Authors: Marco Garosi, Alessandro Conti, Gaowen Liu, Elisa Ricci, Massimiliano Mancini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19145
Pdf URL: https://arxiv.org/pdf/2503.19145
Copy Paste: [[2503.19145]] Compositional Caching for Training-free Open-vocabulary Attribute Detection(https://arxiv.org/abs/2503.19145)
Keywords: large language model
Abstract: Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.

Title: Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

Authors: Yorick Estievenart, Sukanya Patra, Souhaib Ben Taieb
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19146
Pdf URL: https://arxiv.org/pdf/2503.19146
Copy Paste: [[2503.19146]] Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants(https://arxiv.org/abs/2503.19146)
Keywords: generative
Abstract: Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.

Title: HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Authors: Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19157
Pdf URL: https://arxiv.org/pdf/2503.19157
Copy Paste: [[2503.19157]] HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models(https://arxiv.org/abs/2503.19157)
Keywords: generative, large language model
Abstract: We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

Title: Language Model Uncertainty Quantification with Attention Chain

Authors: Yinghao Li, Rushi Qiang, Lama Moukheiber, Chao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19168
Pdf URL: https://arxiv.org/pdf/2503.19168
Copy Paste: [[2503.19168]] Language Model Uncertainty Quantification with Attention Chain(https://arxiv.org/abs/2503.19168)
Keywords: large language model
Abstract: Accurately quantifying a large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an "attention chain" of tokens deemed "semantically crucial" to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. Similarity filtering and probability thresholding further refine the resulting chain, allowing us to approximate the marginal probabilities of the answer tokens, which serve as the LLM's confidence. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.

Title: Graph neural networks extrapolate out-of-distribution for shortest paths

Authors: Robert R. Nerem, Samantha Chen, Sanjoy Dasgupta, Yusu Wang
Subjects: cs.LG, cs.DS
Abstract URL: https://arxiv.org/abs/2503.19173
Pdf URL: https://arxiv.org/pdf/2503.19173
Copy Paste: [[2503.19173]] Graph neural networks extrapolate out-of-distribution for shortest paths(https://arxiv.org/abs/2503.19173)
Keywords: robust
Abstract: Neural networks (NNs), despite their success and wide adoption, still struggle to extrapolate out-of-distribution (OOD), i.e., to inputs that are not well-represented by their training dataset. Addressing the OOD generalization gap is crucial when models are deployed in environments significantly different from the training set, such as applying Graph Neural Networks (GNNs) trained on small graphs to large, real-world graphs. One promising approach for achieving robust OOD generalization is the framework of neural algorithmic alignment, which incorporates ideas from classical algorithms by designing neural architectures that resemble specific algorithmic paradigms (e.g. dynamic programming). The hope is that trained models of this form would have superior OOD capabilities, in much the same way that classical algorithms work for all instances. We rigorously analyze the role of algorithmic alignment in achieving OOD generalization, focusing on graph neural networks (GNNs) applied to the canonical shortest path problem. We prove that GNNs, trained to minimize a sparsity-regularized loss over a small set of shortest path instances, exactly implement the Bellman-Ford (BF) algorithm for shortest paths. In fact, if a GNN minimizes this loss within an error of $\epsilon$, it implements the BF algorithm with an error of $O(\epsilon)$. Consequently, despite limited training data, these GNNs are guaranteed to extrapolate to arbitrary shortest-path problems, including instances of any size. Our empirical results support our theory by showing that NNs trained by gradient descent are able to minimize this loss and extrapolate in practice.

Title: SoK: How Robust is Audio Watermarking in Generative AI models?

Authors: Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, Qiben Yan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19176
Pdf URL: https://arxiv.org/pdf/2503.19176
Copy Paste: [[2503.19176]] SoK: How Robust is Audio Watermarking in Generative AI models?(https://arxiv.org/abs/2503.19176)
Keywords: protect, attack, robust, watermark, generative
Abstract: Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at this https URL.

Title: Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education

Authors: Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19182
Pdf URL: https://arxiv.org/pdf/2503.19182
Copy Paste: [[2503.19182]] Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education(https://arxiv.org/abs/2503.19182)
Keywords: fair, large language model
Abstract: Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.

Title: FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

Authors: Yufan Ren, Zicong Jiang, Tong Zhang, Søren Forchhammer, Sabine Süsstrunk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19191
Pdf URL: https://arxiv.org/pdf/2503.19191
Copy Paste: [[2503.19191]] FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing(https://arxiv.org/abs/2503.19191)
Keywords: diffusion
Abstract: Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.

Title: Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling

Authors: Chayan Banerjee, Kien Nguyen, Clinton Fookes
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2503.19195
Pdf URL: https://arxiv.org/pdf/2503.19195
Copy Paste: [[2503.19195]] Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling(https://arxiv.org/abs/2503.19195)
Keywords: fair
Abstract: Mining process optimization particularly truck dispatch scheduling is a critical factor in enhancing the efficiency of open pit mining operations However the dynamic and stochastic nature of mining environments characterized by uncertainties such as equipment failures truck maintenance and variable haul cycle times poses significant challenges for traditional optimization methods While Reinforcement Learning RL has shown promise in adaptive decision making for mining logistics its practical deployment requires rigorous evaluation in realistic and customizable simulation environments The lack of standardized benchmarking environments limits fair algorithm comparisons reproducibility and the real world applicability of RL based approaches in open pit mining settings To address this challenge we introduce Mining Gym a configurable open source benchmarking environment designed for training testing and comparing RL algorithms in mining process optimization Built on Discrete Event Simulation DES and seamlessly integrated with the OpenAI Gym interface Mining Gym provides a structured testbed that enables the direct application of advanced RL algorithms from Stable Baselines The framework models key mining specific uncertainties such as equipment failures queue congestion and the stochasticity of mining processes ensuring a realistic and adaptive learning environment Additionally Mining Gym features a graphical user interface GUI for intuitive mine site configuration a comprehensive data logging system a built in KPI dashboard and real time visual representation of the mine site These capabilities facilitate standardized reproducible evaluations across multiple RL strategies and baseline heuristics

Title: Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Authors: Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, Francis Engelmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19199
Pdf URL: https://arxiv.org/pdf/2503.19199
Copy Paste: [[2503.19199]] Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces(https://arxiv.org/abs/2503.19199)
Keywords: large language model
Abstract: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at this https URL

Title: A Shared Low-Rank Adaptation Approach to Personalized RLHF

Authors: Renpu Liu, Peng Wang, Donghao Li, Cong Shen, Jing Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19201
Pdf URL: https://arxiv.org/pdf/2503.19201
Copy Paste: [[2503.19201]] A Shared Low-Rank Adaptation Approach to Personalized RLHF(https://arxiv.org/abs/2503.19201)
Keywords: large language model
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.

Title: Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery

Authors: Sara Al-Emadi, Yin Yang, Ferda Ofli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19202
Pdf URL: https://arxiv.org/pdf/2503.19202
Copy Paste: [[2503.19202]] Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery(https://arxiv.org/abs/2503.19202)
Keywords: robust
Abstract: Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets that focus on humanitarian and climate change applications. These datasets enable the investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models. Our datasets and code are available at this https URL.

Title: Overtrained Language Models Are Harder to Fine-Tune

Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19206
Pdf URL: https://arxiv.org/pdf/2503.19206
Copy Paste: [[2503.19206]] Overtrained Language Models Are Harder to Fine-Tune(https://arxiv.org/abs/2503.19206)
Keywords: large language model
Abstract: Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Title: FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

Authors: Rong Wang, Fabian Prada, Ziyan Wang, Zhongshi Jiang, Chengxiang Yin, Junxuan Li, Shunsuke Saito, Igor Santesteban, Javier Romero, Rohan Joshi, Hongdong Li, Jason Saragih, Yaser Sheikh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19207
Pdf URL: https://arxiv.org/pdf/2503.19207
Copy Paste: [[2503.19207]] FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images(https://arxiv.org/abs/2503.19207)
Keywords: robust
Abstract: We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at this https URL.

Title: Byzantine Resilient Federated Multi-Task Representation Learning

Authors: Tuan Le, Shana Moothedath
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19209
Pdf URL: https://arxiv.org/pdf/2503.19209
Copy Paste: [[2503.19209]] Byzantine Resilient Federated Multi-Task Representation Learning(https://arxiv.org/abs/2503.19209)
Keywords: robust, federate
Abstract: In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ geometric median aggregation for robust client-server communication. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed alternating gradient descent algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through extensive experiments using real-world datasets, including CIFAR-10 and FEMINIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.

Title: Towards Terminology Management Automation for Arabic

Authors: Mahdi Nasser, Laura Sayyah, Fadi A. Zaraket
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19211
Pdf URL: https://arxiv.org/pdf/2503.19211
Copy Paste: [[2503.19211]] Towards Terminology Management Automation for Arabic(https://arxiv.org/abs/2503.19211)
Keywords: extraction
Abstract: This paper presents a method and supporting tools for automation of terminology management for Arabic. The tools extract lists of parallel terminology matching terms in foreign languages to their Arabic counterparts from field specific texts. This has significant implications as it can be used to improve consistent translation and use of terms in specialized Arabic academic books, and provides automated aid for enhancing cross lingual text processing. This automation of terminology management aims to reduce processing time, and ensure use of consistent and correct terminology. The extraction takes advantage of naturally occurring term translations. It considers several candidate phrases of varying lengths that co-occur next to the foreign terms. Then it computes several similarity metrics, including lexicographic, phonetic, morphological, and semantic ones to decide the problem. We experiment with heuristic, machine learning, and ML with post processing approaches. This paper reports on a novel curated dataset for the task, an existing expert reviewed industry parallel corpora, and on the performance of the three approaches. The best approach achieved 94.9% precision and 92.4% recall.

Title: A Survey of Large Language Model Agents for Question Answering

Authors: Murong Yue
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.19213
Pdf URL: https://arxiv.org/pdf/2503.19213
Copy Paste: [[2503.19213]] A Survey of Large Language Model Agents for Question Answering(https://arxiv.org/abs/2503.19213)
Keywords: large language model
Abstract: This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.

Title: Face Spoofing Detection using Deep Learning

Authors: Najeebullah, Maaz Salman, Zar Nawab Khan Swati
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19223
Pdf URL: https://arxiv.org/pdf/2503.19223
Copy Paste: [[2503.19223]] Face Spoofing Detection using Deep Learning(https://arxiv.org/abs/2503.19223)
Keywords: security, robust, biometric, transformer
Abstract: Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.

Title: SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings

Authors: Farhana Keya, Gollam Rabby, Prasenjit Mitra, Sahar Vahdati, Sören Auer, Yaser Jaradeh
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2503.19257
Pdf URL: https://arxiv.org/pdf/2503.19257
Copy Paste: [[2503.19257]] SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings(https://arxiv.org/abs/2503.19257)
Keywords: large language model
Abstract: Every scientific discovery starts with an idea inspired by prior work, interdisciplinary concepts, and emerging challenges. Recent advancements in large language models (LLMs) trained on scientific corpora have driven interest in AI-supported idea generation. However, generating context-aware, high-quality, and innovative ideas remains challenging. We introduce SCI-IDEA, a framework that uses LLM prompting strategies and Aha Moment detection for iterative idea refinement. SCI-IDEA extracts essential facets from research publications, assessing generated ideas on novelty, excitement, feasibility, and effectiveness. Comprehensive experiments validate SCI-IDEA's effectiveness, achieving average scores of 6.84, 6.86, 6.89, and 6.84 (on a 1-10 scale) across novelty, excitement, feasibility, and effectiveness, respectively. Evaluations employed GPT-4o, GPT-4.5, DeepSeek-32B (each under 2-shot prompting), and DeepSeek-70B (3-shot prompting), with token-level embeddings used for Aha Moment detection. Similarly, it achieves scores of 6.87, 6.86, 6.83, and 6.87 using GPT-4o under 5-shot prompting, GPT-4.5 under 3-shot prompting, DeepSeek-32B under zero-shot chain-of-thought prompting, and DeepSeek-70B under 5-shot prompting with sentence-level embeddings. We also address ethical considerations such as intellectual credit, potential misuse, and balancing human creativity with AI-driven ideation. Our results highlight SCI-IDEA's potential to facilitate the structured and flexible exploration of context-aware scientific ideas, supporting innovation while maintaining ethical standards.

Title: Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing

Authors: Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19258
Pdf URL: https://arxiv.org/pdf/2503.19258
Copy Paste: [[2503.19258]] Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing(https://arxiv.org/abs/2503.19258)
Keywords: robust
Abstract: Hyperspectral unmixing (HU) is a critical yet challenging task in remote sensing. However, existing nonnegative matrix factorization (NMF) methods with graph learning mostly focus on first-order or second-order nearest neighbor relationships and usually require manual parameter tuning, which fails to characterize intrinsic data structures. To address the above issues, we propose a novel adaptive multi-order graph regularized NMF method (MOGNMF) with three key features. First, multi-order graph regularization is introduced into the NMF framework to exploit global and local information comprehensively. Second, these parameters associated with the multi-order graph are learned adaptively through a data-driven approach. Third, dual sparsity is embedded to obtain better robustness, i.e., $\ell_{1/2}$-norm on the abundance matrix and $\ell_{2,1}$-norm on the noise matrix. To solve the proposed model, we develop an alternating minimization algorithm whose subproblems have explicit solutions, thus ensuring effectiveness. Experiments on simulated and real hyperspectral data indicate that the proposed method delivers better unmixing results.

Title: Linguistic Blind Spots of Large Language Models

Authors: Jiali Cheng, Hadi Amiri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19260
Pdf URL: https://arxiv.org/pdf/2503.19260
Copy Paste: [[2503.19260]] Linguistic Blind Spots of Large Language Models(https://arxiv.org/abs/2503.19260)
Keywords: large language model
Abstract: Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.

Title: Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing

Authors: Ruiyi Wang, Yushuo Zheng, Zicheng Zhang, Chunyi Li, Shuaicheng Liu, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19262
Pdf URL: https://arxiv.org/pdf/2503.19262
Copy Paste: [[2503.19262]] Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing(https://arxiv.org/abs/2503.19262)
Keywords: robust, diffusion, generative
Abstract: Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at this https URL.

Title: DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

Authors: Fucai Ke, Vijay Kumar B G, Xingjian Leng, Zhixi Cai, Zaid Khan, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi, Manmohan Chandraker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19263
Pdf URL: https://arxiv.org/pdf/2503.19263
Copy Paste: [[2503.19263]] DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning(https://arxiv.org/abs/2503.19263)
Keywords: large language model
Abstract: Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.

Title: PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

Authors: Sarah Pungitore, Shashank Yadav, Vignesh Subbian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19265
Pdf URL: https://arxiv.org/pdf/2503.19265
Copy Paste: [[2503.19265]] PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping(https://arxiv.org/abs/2503.19265)
Keywords: large language model
Abstract: Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.

Title: MARS: Memory-Enhanced Agents with Reflective Self-improvement

Authors: Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Jingsong Yang, Tianyu Shi, Yuantao Wang, Miao Zhang, Xueqian Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19271
Pdf URL: https://arxiv.org/pdf/2503.19271
Copy Paste: [[2503.19271]] MARS: Memory-Enhanced Agents with Reflective Self-improvement(https://arxiv.org/abs/2503.19271)
Keywords: large language model
Abstract: Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.

Title: Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications

Authors: Ben Rahman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19276
Pdf URL: https://arxiv.org/pdf/2503.19276
Copy Paste: [[2503.19276]] Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications(https://arxiv.org/abs/2503.19276)
Keywords: robust, extraction, transformer, large language model, segmentation
Abstract: Semantic segmentation has made significant strides in pixel-level image understanding, yet it remains limited in capturing contextual and semantic relationships between objects. Current models, such as CNN and Transformer-based architectures, excel at identifying pixel-level features but fail to distinguish semantically similar objects (e.g., "doctor" vs. "nurse" in a hospital scene) or understand complex contextual scenarios (e.g., differentiating a running child from a regular pedestrian in autonomous driving). To address these limitations, we proposed a novel Context-Aware Semantic Segmentation framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones. Our hybrid model leverages the Swin Transformer for robust visual feature extraction and GPT-4 for enriching semantic understanding through text embeddings. A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively. Additionally, Graph Neural Networks (GNNs) are employed to model object relationships within the scene, capturing dependencies that are overlooked by traditional models. Experimental results on benchmark datasets (e.g., COCO, Cityscapes) demonstrate that our approach outperforms the existing methods in both pixel-level accuracy (mIoU) and contextual understanding (mAP). This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.

Title: Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines

Authors: Junle Liu, Yun Zhang, Zixi Guo
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.19278
Pdf URL: https://arxiv.org/pdf/2503.19278
Copy Paste: [[2503.19278]] Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines(https://arxiv.org/abs/2503.19278)
Keywords: segmentation
Abstract: Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance-based Bit Allocation (MFIBA) for end-to-end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss-rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end-to-end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC-TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.

Title: Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves

Authors: Wenjuan Qin, Weiran Wang, Yuming Yang, Tao Gui
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19279
Pdf URL: https://arxiv.org/pdf/2503.19279
Copy Paste: [[2503.19279]] Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves(https://arxiv.org/abs/2503.19279)
Keywords: robust
Abstract: The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus. Prior studies on argumentative moves often rely on qualitative analysis and manual coding, limiting their efficiency and generalizability. The study aims to: 1) to assess the reliability of PLMs in analyzing argumentative moves; 2) to utilize PLM-generated annotations to illustrate developmental patterns and predict writing quality. A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types: claim, data, counter-claim, counter-data, rebuttal, and non-argument. The corpus is divided into training, validation, and application sets annotated by human experts and PLMs. We use BERT as one of the implementations of PLMs. The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field. Additionally, PLM-labeled argumentative moves effectively capture developmental patterns and predict writing quality. Over time, students exhibit an increase in the use of data and counter-claims and a decrease in non-argument moves. While low-quality texts are characterized by a predominant use of claims and data supporting only oneside position, mid- and high-quality texts demonstrate an integrative perspective with a higher ratio of counter-claims, counter-data, and rebuttals. This study underscores the transformative potential of integrating artificial intelligence into language education, enhancing the efficiency and accuracy of evaluating students' writing. The successful application of PLMs can catalyze the development of educational technology, promoting a more data-driven and personalized learning environment that supports diverse educational needs.

Title: ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency

Authors: Yang Ren, Hai Jiang, Menglong Yang, Wei Li, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19283
Pdf URL: https://arxiv.org/pdf/2503.19283
Copy Paste: [[2503.19283]] ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency(https://arxiv.org/abs/2503.19283)
Keywords: diffusion, generative
Abstract: RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at this https URL.

Title: No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention Mechanism

Authors: Yubo Li, Xinyu Yao, Rema Padman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19285
Pdf URL: https://arxiv.org/pdf/2503.19285
Copy Paste: [[2503.19285]] No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention Mechanism(https://arxiv.org/abs/2503.19285)
Keywords: interpretability, explainability, transformer
Abstract: Despite the outstanding performance of deep learning models in clinical prediction tasks, explainability remains a significant challenge. Inspired by transformer architectures, we introduce the Temporal-Feature Cross Attention Mechanism (TFCAM), a novel deep learning framework designed to capture dynamic interactions among clinical features across time, enhancing both predictive accuracy and interpretability. In an experiment with 1,422 patients with Chronic Kidney Disease, predicting progression to End-Stage Renal Disease, TFCAM outperformed LSTM and RETAIN baselines, achieving an AUROC of 0.95 and an F1-score of 0.69. Beyond performance gains, TFCAM provides multi-level explainability by identifying critical temporal periods, ranking feature importance, and quantifying how features influence each other across time before affecting predictions. Our approach addresses the "black box" limitations of deep learning in healthcare, offering clinicians transparent insights into disease progression mechanisms while maintaining state-of-the-art predictive performance.

Title: Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment

Authors: Guanglu Dong, Xiangyu Liao, Mingyang Li, Guihuan Guo, Chao Ren
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19295
Pdf URL: https://arxiv.org/pdf/2503.19295
Copy Paste: [[2503.19295]] Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment(https://arxiv.org/abs/2503.19295)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world SISR, and OU NR-IQA tasks demonstrate the effectiveness of our proposed methods.

Title: UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

Authors: Xiangzhe Kong, Zishen Zhang, Ziting Zhang, Rui Jiao, Jianzhu Ma, Kai Liu, Wenbing Huang, Yang Liu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.19300
Pdf URL: https://arxiv.org/pdf/2503.19300
Copy Paste: [[2503.19300]] UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design(https://arxiv.org/abs/2503.19300)
Keywords: diffusion, generative
Abstract: The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Unified generative Modeling of 3D Molecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

Title: BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation

Authors: Hanshuo Qiu, Jie Jiang, Ruoli Yang, Lixin Zhan, Jizhao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19303
Pdf URL: https://arxiv.org/pdf/2503.19303
Copy Paste: [[2503.19303]] BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation(https://arxiv.org/abs/2503.19303)
Keywords: extraction, segmentation
Abstract: RGB-T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB-T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi-layer decoder structure, incorporating the shallow-level feature iteration module (SFI-Module), the deep-level feature iteration module (DFI-Module), and the multi-feature enhancement module (MFE-Module) to collaboratively extract texture details and global skeleton information, with multi-module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII-Net achieves state-of-the-art (SOTA) performance in the brain-inspired computing domain and outperforms most existing RGB-T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB-T datasets, proving the effectiveness of brain-inspired computer models in multi-modal image segmentation tasks.

Title: A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation

Authors: Chaohan Wang, Yutong Xie, Qi Chen, Yuyin Zhou, Qi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19308
Pdf URL: https://arxiv.org/pdf/2503.19308
Copy Paste: [[2503.19308]] A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation(https://arxiv.org/abs/2503.19308)
Keywords: transformer, segmentation
Abstract: Mamba, with its selective State Space Models (SSMs), offers a more computationally efficient solution than Transformers for long-range dependency modeling. However, there is still a debate about its effectiveness in high-resolution 3D medical image segmentation. In this study, we present a comprehensive investigation into Mamba's capabilities in 3D medical image segmentation by tackling three pivotal questions: Can Mamba replace Transformers? Can it elevate multi-scale representation learning? Is complex scanning necessary to unlock its full potential? We evaluate Mamba's performance across three large public benchmarks-AMOS, TotalSegmentator, and BraTS. Our findings reveal that UlikeMamba, a U-shape Mamba-based network, consistently surpasses UlikeTrans, a U-shape Transformer-based network, particularly when enhanced with custom-designed 3D depthwise convolutions, boosting accuracy and computational efficiency. Further, our proposed multi-scale Mamba block demonstrates superior performance in capturing both fine-grained details and global context, especially in complex segmentation tasks, surpassing Transformer-based counterparts. We also critically assess complex scanning strategies, finding that simpler methods often suffice, while our Tri-scan approach delivers notable advantages in the most challenging scenarios. By integrating these advancements, we introduce a new network for 3D medical image segmentation, positioning Mamba as a transformative force that outperforms leading models such as nnUNet, CoTr, and U-Mamba, offering competitive accuracy with superior computational efficiency. This study provides key insights into Mamba's unique advantages, paving the way for more efficient and accurate approaches to 3D medical imaging.

Title: Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees

Authors: Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, Sören Auer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19309
Pdf URL: https://arxiv.org/pdf/2503.19309
Copy Paste: [[2503.19309]] Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees(https://arxiv.org/abs/2503.19309)
Keywords: large language model
Abstract: Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. To address these limitations, we propose the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a novel framework that integrates Monte Carlo Tree Search with Nash Equilibrium strategies to iteratively refine and validate hypotheses. MC-NEST dynamically balances exploration and exploitation through adaptive sampling strategies, which prioritize high-potential hypotheses while maintaining diversity in the search space. We demonstrate the effectiveness of MC-NEST through comprehensive experiments across multiple domains, including biomedicine, social science, and computer science. MC-NEST achieves average scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability metrics on the social science, computer science, and biomedicine datasets, respectively, outperforming state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. These results underscore MC-NEST's ability to generate high-quality, empirically grounded hypotheses across diverse domains. Furthermore, MC-NEST facilitates structured human-AI collaboration, ensuring that LLMs augment human creativity rather than replace it. By addressing key challenges such as iterative refinement and the exploration-exploitation balance, MC-NEST sets a new benchmark in automated hypothesis generation. Additionally, MC-NEST's ethical design enables responsible AI use, emphasizing transparency and human supervision in hypothesis generation.

Title: LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

Authors: Weizhi Chen, Jingbo Chen, Yupeng Deng, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19311
Pdf URL: https://arxiv.org/pdf/2503.19311
Copy Paste: [[2503.19311]] LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text(https://arxiv.org/abs/2503.19311)
Keywords: large language model
Abstract: This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at this https URL.

Title: Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing

Authors: Ahmed Omara, Burak Kantarci
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19318
Pdf URL: https://arxiv.org/pdf/2503.19318
Copy Paste: [[2503.19318]] Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing(https://arxiv.org/abs/2503.19318)
Keywords: attack, generative
Abstract: As Artificial Intelligence (AI) becomes increasingly integrated into microgrid control systems, the risk of malicious actors exploiting vulnerabilities in Machine Learning (ML) algorithms to disrupt power generation and distribution grows. Detection models to identify adversarial attacks need to meet the constraints of edge environments, where computational power and memory are often limited. To address this issue, we propose a novel strategy that optimizes detection models for Vehicle-to-Microgrid (V2M) edge environments without compromising performance against inference and evasion attacks. Our approach integrates model design and compression into a unified process and results in a highly compact detection model that maintains high accuracy. We evaluated our method against four benchmark evasion attacks-Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Carlini & Wagner method (C&W) and Conditional Generative Adversarial Network (CGAN) method-and two knowledge-based attacks, white-box and gray-box. Our optimized model reduces memory usage from 20MB to 1.3MB, inference time from 3.2 seconds to 0.9 seconds, and GPU utilization from 5% to 2.68%.

Title: How to optimize K-means?

Authors: Qi Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19324
Pdf URL: https://arxiv.org/pdf/2503.19324
Copy Paste: [[2503.19324]] How to optimize K-means?(https://arxiv.org/abs/2503.19324)
Keywords: robust
Abstract: Center-based clustering algorithms (e.g., K-means) are popular for clustering tasks, but they usually struggle to achieve high accuracy on complex datasets. We believe the main reason is that traditional center-based clustering algorithms identify only one clustering center in each cluster. Once the distribution of the dataset is complex, a single clustering center cannot strongly represent distant objects within the cluster. How to optimize the existing center-based clustering algorithms will be valuable research. In this paper, we propose a general optimization method called ECAC, and it can optimize different center-based clustering algorithms. ECAC is independent of the clustering principle and is embedded as a component between the center process and the category assignment process of center-based clustering algorithms. Specifically, ECAC identifies several extended-centers for each clustering center. The extended-centers will act as relays to expand the representative capability of the clustering center in the complex cluster, thus improving the accuracy of center-based clustering algorithms. We conducted numerous experiments to verify the robustness and effectiveness of ECAC. ECAC is robust to diverse datasets and diverse clustering centers. After ECAC optimization, the accuracy (NMI as well as RI) of center-based clustering algorithms improves by an average of 33.4% and 64.1%, respectively, and even K-means accurately identifies complex-shaped clusters.

Title: Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Authors: Yuchao Gu, Weijia Mao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19325
Pdf URL: https://arxiv.org/pdf/2503.19325
Copy Paste: [[2503.19325]] Long-Context Autoregressive Video Modeling with Next-Frame Prediction(https://arxiv.org/abs/2503.19325)
Keywords: diffusion, transformer
Abstract: Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

Title: ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

Authors: Chau Pham, Juan C. Caicedo, Bryan A. Plummer
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19331
Pdf URL: https://arxiv.org/pdf/2503.19331
Copy Paste: [[2503.19331]] ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning(https://arxiv.org/abs/2503.19331)
Keywords: robust, transformer
Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI.

Title: Membership Inference Attacks on Large-Scale Models: A Survey

Authors: Hengyu Wu, Yang Cao
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.19338
Pdf URL: https://arxiv.org/pdf/2503.19338
Copy Paste: [[2503.19338]] Membership Inference Attacks on Large-Scale Models: A Survey(https://arxiv.org/abs/2503.19338)
Keywords: privacy, attack, membership infer, large language model
Abstract: The adoption of the Large Language Model (LLM) has accelerated dramatically since the ChatGPT from OpenAI went online in November 2022. Recent advances in Large Multimodal Models (LMMs), which process diverse data types and enable interaction through various channels, have expanded beyond the text-to-text limitations of early LLMs, attracting significant and concurrent attention from both researchers and industry. While LLMs and LMMs are starting to spread widely, concerns about their privacy risks are increasing as well. Membership Inference Attacks (MIAs), techniques used to determine whether a particular data point was part of a model's training set, serve as a key metric for assessing the privacy vulnerabilities of machine learning models. Hu et al. show that various machine learning algorithms are vulnerable to MIA. Despite extensive studies on MIAs in traditional models, there remains a lack of systematic surveys addressing their effectiveness and implications in modern large-scale models like LLMs and LMMs. In this paper, we systematically reviewed recent studies of MIA against LLMs and LMMs. We analyzed and categorized each attack based on their methodology and scenario and discussed the limitations in existing research. Additionally, we examine privacy concerns associated with the fine-tuning process. Finally, we provided some suggestions for future research in this direction.

Title: Efficient IoT Intrusion Detection with an Improved Attention-Based CNN-BiLSTM Architecture

Authors: Amna Naeem, Muazzam A. Khan, Nada Alasbali, Jawad Ahmad, Aizaz Ahmad Khattak, Muhammad Shahbaz Khan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19339
Pdf URL: https://arxiv.org/pdf/2503.19339
Copy Paste: [[2503.19339]] Efficient IoT Intrusion Detection with an Improved Attention-Based CNN-BiLSTM Architecture(https://arxiv.org/abs/2503.19339)
Keywords: security, defense, attack, extraction
Abstract: The ever-increasing security vulnerabilities in the Internet-of-Things (IoT) systems require improved threat detection approaches. This paper presents a compact and efficient approach to detect botnet attacks by employing an integrated approach that consists of traffic pattern analysis, temporal support learning, and focused feature extraction. The proposed attention-based model benefits from a hybrid CNN-BiLSTM architecture and achieves 99% classification accuracy in detecting botnet attacks utilizing the N-BaIoT dataset, while maintaining high precision and recall across various scenarios. The proposed model's performance is further validated by key parameters, such as Mathews Correlation Coefficient and Cohen's kappa Correlation Coefficient. The close-to-ideal results for these parameters demonstrate the proposed model's ability to detect botnet attacks accurately and efficiently in practical settings and on unseen data. The proposed model proved to be a powerful defense mechanism for IoT networks to face emerging security challenges.

Title: BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction

Authors: Yuguang Li, Ivaylo Boyadzhiev, Zixuan Liu, Linda Shapiro, Alex Colburn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19340
Pdf URL: https://arxiv.org/pdf/2503.19340
Copy Paste: [[2503.19340]] BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction(https://arxiv.org/abs/2503.19340)
Keywords: robust, diffusion
Abstract: Reconstructing precise camera poses and floor plan layouts from wide-baseline RGB panoramas is a difficult and unsolved problem. We introduce BADGR, a novel diffusion model that jointly performs reconstruction and bundle adjustment (BA) to refine poses and layouts from a coarse state, using 1D floor boundary predictions from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-entity outputs from a single-step Levenberg Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layout reconstruction with different input densities.

Title: Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent

Authors: Philip Doldo, Derek Everett, Amol Khanna, Andre T Nguyen, Edward Raff
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.19347
Pdf URL: https://arxiv.org/pdf/2503.19347
Copy Paste: [[2503.19347]] Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent(https://arxiv.org/abs/2503.19347)
Keywords: attack, robust
Abstract: Projected Gradient Descent (PGD) under the $L_\infty$ ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when using thousands of iterations is the current best-practice recommendation to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the \emph{exact} same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable.

Title: QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

Authors: Yuxuan Hu, Xiaodong Chen, Cuiping Li, Hong Chen, Jing Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.19353
Pdf URL: https://arxiv.org/pdf/2503.19353
Copy Paste: [[2503.19353]] QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition(https://arxiv.org/abs/2503.19353)
Keywords: large language model
Abstract: Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{this https URL}{repository}.

Title: Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models

Authors: Yuta Hirabayashi, Daisuke Matsuoka
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2503.19354
Pdf URL: https://arxiv.org/pdf/2503.19354
Copy Paste: [[2503.19354]] Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models(https://arxiv.org/abs/2503.19354)
Keywords: diffusion
Abstract: Data-driven weather prediction models exhibit promising performance and advance continuously. In particular, diffusion models represent fine-scale details without spatial smoothing, which is crucial for mesoscale predictions, such as heavy rainfall forecasting. However, the applications of diffusion models to mesoscale prediction remain limited. To address this gap, this study proposes an architecture that combines a diffusion model with Swin-Unet as a deterministic model, achieving mesoscale predictions while maintaining flexibility. The proposed architecture trains the two models independently, allowing the diffusion model to remain unchanged when the deterministic model is updated. Comparisons using the Fractions Skill Score and power spectral analysis demonstrate that incorporating the diffusion model leads to improved accuracy compared to predictions without it. These findings underscore the potential of the proposed architecture to enhance mesoscale predictions, particularly for strong rainfall events, while maintaining flexibility.

Title: ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Authors: Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B.G, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19355
Pdf URL: https://arxiv.org/pdf/2503.19355
Copy Paste: [[2503.19355]] ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models(https://arxiv.org/abs/2503.19355)
Keywords: robust
Abstract: Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: this https URL.

Title: Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

Authors: Farzad Beizaee, Gregory A. Lodygensky, Christian Desrosiers, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19357
Pdf URL: https://arxiv.org/pdf/2503.19357
Copy Paste: [[2503.19357]] Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection(https://arxiv.org/abs/2503.19357)
Keywords: diffusion
Abstract: Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed \textbf{Deviation correction diffusion} (\Ours) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14\% over state-of-the-art models on well known anomaly detection datasets. The code is available at this https URL

Title: From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting

Authors: Zhiwei Huang, Hailin Yu, Yichun Shentu, Jin Yuan, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19358
Pdf URL: https://arxiv.org/pdf/2503.19358
Copy Paste: [[2503.19358]] From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting(https://arxiv.org/abs/2503.19358)
Keywords: robust
Abstract: This paper presents a novel camera relocalization method, STDLoc, which leverages Feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localization paradigm. Based on this scene representation, we introduce a novel matching-oriented Gaussian sampling strategy and a scene-specific detector to achieve efficient and robust initial pose estimation. Furthermore, based on the initial localization results, we align the query feature map to the Gaussian feature field by dense feature matching to enable accurate localization. The experiments on indoor and outdoor datasets show that STDLoc outperforms current state-of-the-art localization methods in terms of localization accuracy and recall.

Title: Show and Segment: Universal Medical Image Segmentation via In-Context Learning

Authors: Yunhe Gao, Di Liu, Zhuowei Li, Yunsheng Li, Dongdong Chen, Mu Zhou, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19359
Pdf URL: https://arxiv.org/pdf/2503.19359
Copy Paste: [[2503.19359]] Show and Segment: Universal Medical Image Segmentation via In-Context Learning(https://arxiv.org/abs/2503.19359)
Keywords: segmentation
Abstract: Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present Iris, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. By decoupling task encoding from inference, Iris supports diverse strategies from one-shot inference and context example ensemble to object-level context example retrieval and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to task-specific models on in-distribution tasks. On seven held-out datasets, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris's task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into medical objects without explicit anatomical supervision.

Title: ImageSet2Text: Describing Sets of Images through Text

Authors: Piera Riccio, Francesco Galati, Kajetan Schweighofer, Noa Garcia, Nuria Oliver
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19361
Pdf URL: https://arxiv.org/pdf/2503.19361
Copy Paste: [[2503.19361]] ImageSet2Text: Describing Sets of Images through Text(https://arxiv.org/abs/2503.19361)
Keywords: interpretability
Abstract: We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.

Title: VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

Authors: Zizhi Chen, Minghao Han, Xukun Zhang, Shuwei Ma, Tao Liu, Xing Wei, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19367
Pdf URL: https://arxiv.org/pdf/2503.19367
Copy Paste: [[2503.19367]] VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction(https://arxiv.org/abs/2503.19367)
Keywords: extraction, transformer, generative
Abstract: Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is this https URL.

Title: EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

Authors: Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19369
Pdf URL: https://arxiv.org/pdf/2503.19369
Copy Paste: [[2503.19369]] EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models(https://arxiv.org/abs/2503.19369)
Keywords: diffusion, generative
Abstract: The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose \textbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available this https URL.

Title: A Benign Activity Extraction Method for Malignant Activity Identification using Data Provenance

Authors: Taishin Saito
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19370
Pdf URL: https://arxiv.org/pdf/2503.19370
Copy Paste: [[2503.19370]] A Benign Activity Extraction Method for Malignant Activity Identification using Data Provenance(https://arxiv.org/abs/2503.19370)
Keywords: attack, extraction
Abstract: In order to understand the overall picture of cyber attacks and to identify the source of cyber attacks, a method to identify malicious activities by automatically creating a graph that ties together the dependencies of a series of related events by tracking Data Provenance has been developed. However, the problem of dependency explosion, in which a large number of normal computer system operations such as operations by authorized users are included in the dependencies, results in a huge generated graph, making it difficult to identify malicious activities. In this paper, we propose a method to reduce the search space for malicious activities by extracting and removing frequently occurring benign activities through natural language processing of log data and analysis of activities in the computer system using similarity judgments. In the evaluation experiment, we used the DARPA TC Dateset, a large-scale public dataset, to evaluate the effectiveness of the proposed method on the dependency explosion problem. In addition, we showed that about 6.8 to 39% of the activities in a computer system could be defined as patterns of benign activities. In addition, we showed that removing benign activities extracted from a portion of the log data (approximately 1.4% to 3.2% in size) can significantly reduce the search space (up to approximately 52%) in large data sets.

Title: DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

Authors: Hyeongjin Nam, Donghwan Kim, Jeongtaek Oh, Kyoung Mu Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19373
Pdf URL: https://arxiv.org/pdf/2503.19373
Copy Paste: [[2503.19373]] DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image(https://arxiv.org/abs/2503.19373)
Keywords: diffusion
Abstract: Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at this https URL.

Title: Interpretable Generative Models through Post-hoc Concept Bottlenecks

Authors: Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19377
Pdf URL: https://arxiv.org/pdf/2503.19377
Copy Paste: [[2503.19377]] Interpretable Generative Models through Post-hoc Concept Bottlenecks(https://arxiv.org/abs/2503.19377)
Keywords: interpretability, diffusion, generative
Abstract: Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.

Title: Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks

Authors: Yiwei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19380
Pdf URL: https://arxiv.org/pdf/2503.19380
Copy Paste: [[2503.19380]] Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks(https://arxiv.org/abs/2503.19380)
Keywords: security, privacy, protect, robust, federate, transformer
Abstract: This study proposes a risk pricing anomaly detection method for social network user portraits based on graph neural networks (GNNs), aiming to improve the ability to identify abnormal users in social network environments. In view of the limitations of traditional methods in social network data modeling, this paper combines graph autoencoders (GAEs) and graph attention networks (GATs) to achieve accurate detection of abnormal users through dynamic aggregation of neighbor features and reconstruction error evaluation. The Facebook Page-Page Network dataset is used in the experiment and compared with VAE, GNN, Transformer and GAE. The results show that the proposed method achieves the best performance in AUC, F1-score, Precision and Recall, verifying its effectiveness. In addition, this paper explores the computational efficiency of the model in large-scale data and looks forward to combining self-supervised learning, federated learning, and other technologies in the future to improve the robustness and privacy protection of risk assessment. The research results can provide efficient anomaly detection solutions for financial risk control, social security management, and other fields.

Title: MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

Authors: Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela S.M. Lau, Guosheng Yin, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19383
Pdf URL: https://arxiv.org/pdf/2503.19383
Copy Paste: [[2503.19383]] MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation(https://arxiv.org/abs/2503.19383)
Keywords: diffusion
Abstract: Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.

Title: Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Authors: Jaihoon Kim, Taehoon Yoon, Jisung Hwang, Minhyuk Sung
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19385
Pdf URL: https://arxiv.org/pdf/2503.19385
Copy Paste: [[2503.19385]] Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing(https://arxiv.org/abs/2503.19385)
Keywords: diffusion, generative
Abstract: We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Title: Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

Authors: Peishan Huang, Dong Li
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.19386
Pdf URL: https://arxiv.org/pdf/2503.19386
Copy Paste: [[2503.19386]] Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model(https://arxiv.org/abs/2503.19386)
Keywords: segmentation
Abstract: In recent years, the rapid development of machine learning has brought reforms and challenges to traditional communication systems. Semantic communication has appeared as an effective strategy to effectively extract relevant semantic signals semantic segmentation labels and image features for image transmission. However, the insufficient number of extracted semantic features of images will potentially result in a low reconstruction accuracy, which hinders the practical applications and still remains challenging for solving. In order to fill this gap, this letter proposes a multi-text transmission semantic communication (Multi-SC) system, which uses the visual language model (VLM) to assist in the transmission of image semantic signals. Unlike previous image transmission semantic communication systems, the proposed system divides the image into multiple blocks and extracts multiple text information from the image using a modified large language and visual assistant (LLaVA), and combines semantic segmentation tags with semantic text for image recovery. Simulation results show that the proposed text semantics diversity scheme can significantly improve the reconstruction accuracy compared with related works.

Title: QUIC-Fuzz: An Effective Greybox Fuzzer For The QUIC Protocol

Authors: Kian Kai Ang, Damith C. Ranasinghe
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2503.19402
Pdf URL: https://arxiv.org/pdf/2503.19402
Copy Paste: [[2503.19402]] QUIC-Fuzz: An Effective Greybox Fuzzer For The QUIC Protocol(https://arxiv.org/abs/2503.19402)
Keywords: secure, security, protect, attack
Abstract: Network applications are routinely under attack. We consider the problem of developing an effective and efficient fuzzer for the recently ratified QUIC network protocol to uncover security vulnerabilities. QUIC offers a unified transport layer for low latency, reliable transport streams that is inherently secure, ultimately representing a complex protocol design characterised by new features and capabilities for the Internet. Fuzzing a secure transport layer protocol is not trivial. The interactive, strict, rule-based, asynchronous nature of communications with a target, the stateful nature of interactions, security mechanisms to protect communications (such as integrity checks and encryption), and inherent overheads (such as target initialisation) challenge generic network protocol fuzzers. We discuss and address the challenges pertinent to fuzzing transport layer protocols (like QUIC), developing mechanisms that enable fast, effective fuzz testing of QUIC implementations to build a prototype grey-box mutation-based fuzzer; QUIC-Fuzz. We test 6, well-maintained server-side implementations, including from Google and Alibaba with QUIC-Fuzz. The results demonstrate the fuzzer is both highly effective and generalisable. Our testing uncovered 10 new security vulnerabilities, precipitating 2 CVE assignments thus far. In code coverage, QUIC-Fuzz outperforms other existing state-of-the-art network protocol fuzzers (Fuzztruction-Net, ChatAFL, and ALFNet) with up to an 84% increase in code coverage where QUIC-Fuzz outperformed statistically significantly across all targets and with a majority of bugs only discoverable by QUIC-Fuzz. We open-source QUIC-Fuzz on GitHub.

Title: Multi-modal 3D Pose and Shape Estimation with Computed Tomography

Authors: Mingxiao Tu, Hoijoon Jung, Alireza Moghadam, Jineel Raythatha, Lachlan Allan, Jeremy Hsu, Andre Kyme, Jinman Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19405
Pdf URL: https://arxiv.org/pdf/2503.19405
Copy Paste: [[2503.19405]] Multi-modal 3D Pose and Shape Estimation with Computed Tomography(https://arxiv.org/abs/2503.19405)
Keywords: robust
Abstract: In perioperative care, precise in-bed 3D patient pose and shape estimation (PSE) can be vital in optimizing patient positioning in preoperative planning, enabling accurate overlay of medical images for augmented reality-based surgical navigation, and mitigating risks of prolonged immobility during recovery. Conventional PSE methods relying on modalities such as RGB-D, infrared, or pressure maps often struggle with occlusions caused by bedding and complex patient positioning, leading to inaccurate estimation that can affect clinical outcomes. To address these challenges, we present the first multi-modal in-bed patient 3D PSE network that fuses detailed geometric features extracted from routinely acquired computed tomography (CT) scans with depth maps (mPSE-CT). mPSE-CT incorporates a shape estimation module that utilizes probabilistic correspondence alignment, a pose estimation module with a refined neural network, and a final parameters mixing module. This multi-modal network robustly reconstructs occluded body regions and enhances the accuracy of the estimated 3D human mesh model. We validated mPSE-CT using proprietary whole-body rigid phantom and volunteer datasets in clinical scenarios. mPSE-CT outperformed the best-performing prior method by 23% and 49.16% in pose and shape estimation respectively, demonstrating its potential for improving clinical outcomes in challenging perioperative environments.

Title: M$^2$CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation

Authors: Ziyuan Liu, Jiawei Zhang, Wenyu Wang, Yuantao Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19406
Pdf URL: https://arxiv.org/pdf/2503.19406
Copy Paste: [[2503.19406]] M$^2$CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation(https://arxiv.org/abs/2503.19406)
Keywords: transformer
Abstract: Most existing change detection (CD) methods focus on optical images captured at different times, and deep learning (DL) has achieved remarkable success in this domain. However, in extreme scenarios such as disaster response, synthetic aperture radar (SAR), with its active imaging capability, is more suitable for providing post-event data. This introduces new challenges for CD methods, as existing weight-sharing Siamese networks struggle to effectively learn the cross-modal data distribution between optical and SAR images. To address this challenge, we propose a unified MultiModal CD framework, M$^2$CD. We integrate Mixture of Experts (MoE) modules into the backbone to explicitly handle diverse modalities, thereby enhancing the model's ability to learn multimodal data distributions. Additionally, we innovatively propose an Optical-to-SAR guided path (O2SP) and implement self-distillation during training to reduce the feature space discrepancy between different modalities, further alleviating the model's learning burden. We design multiple variants of M$^2$CD based on both CNN and Transformer backbones. Extensive experiments validate the effectiveness of the proposed framework, with the MiT-b1 version of M$^2$CD outperforming all state-of-the-art (SOTA) methods in optical-SAR CD tasks.

Title: DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

Authors: Suyoung Bae, YunSeok Choi, Jee-Hyong Lee
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.19426
Pdf URL: https://arxiv.org/pdf/2503.19426
Copy Paste: [[2503.19426]] DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models(https://arxiv.org/abs/2503.19426)
Keywords: fair, large language model
Abstract: While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

Title: Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models

Authors: Masaya Hasegawa, Koji Yasuda
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19429
Pdf URL: https://arxiv.org/pdf/2503.19429
Copy Paste: [[2503.19429]] Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models(https://arxiv.org/abs/2503.19429)
Keywords: diffusion
Abstract: Diffusion models, which have been advancing rapidly in recent years, may generate samples that closely resemble the training data. This phenomenon, known as memorization, may lead to copyright issues. In this study, we propose a method to quantify the ease of reproducing training data in unconditional diffusion models. The average of a sample population following the Langevin equation in the reverse diffusion process moves according to a first-order ordinary differential equation (ODE). This ODE establishes a 1-to-1 correspondence between images and their noisy counterparts in the latent space. Since the ODE is reversible and the initial noisy images are sampled randomly, the volume of an image's projected area represents the probability of generating those images. We examined the ODE, which projects images to latent space, and succeeded in quantifying the ease of reproducing training data by measuring the volume growth rate in this process. Given the relatively low computational complexity of this method, it allows us to enhance the quality of training data by detecting and modifying the easily memorized training samples.

Title: COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

Authors: Jiaxin Zhang, Junjun Jiang, Youyu Chen, Kui Jiang, Xianming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19443
Pdf URL: https://arxiv.org/pdf/2503.19443
Copy Paste: [[2503.19443]] COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting(https://arxiv.org/abs/2503.19443)
Keywords: robust, segmentation
Abstract: Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained model, yielding clear boundaries while preserving high visual quality. Code is available at this https URL.

Title: Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model

Authors: Changyong He, Jin Zeng, Jiawei Zhang, Jiajie Guo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19448
Pdf URL: https://arxiv.org/pdf/2503.19448
Copy Paste: [[2503.19448]] Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model(https://arxiv.org/abs/2503.19448)
Keywords: robust, diffusion
Abstract: Time-of-Flight (ToF) sensors efficiently capture scene depth, but the nonlinear depth construction procedure often results in extremely large noise variance or even invalid areas. Recent methods based on deep neural networks (DNNs) achieve enhanced ToF denoising accuracy but tend to struggle when presented with severe noise corruption due to limited prior knowledge of ToF data distribution. In this paper, we propose DepthCAD, a novel ToF denoising approach that ensures global structural smoothness by leveraging the rich prior knowledge in Stable Diffusion and maintains local metric accuracy by steering the diffusion process with confidence guidance. To adopt the pretrained image diffusion model to ToF depth denoising, we apply the diffusion on raw ToF correlation measurements with dynamic range normalization before converting to depth maps. Experimental results validate the state-of-the-art performance of the proposed scheme, and the evaluation on real data further verifies its robustness against real-world ToF noise.

Title: SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

Authors: Yiqing Li, Xuan Wang, Jiawei Wu, Yikun Ma, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19452
Pdf URL: https://arxiv.org/pdf/2503.19452
Copy Paste: [[2503.19452]] SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors(https://arxiv.org/abs/2503.19452)
Keywords: diffusion, generative
Abstract: Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.

Title: Data-centric Federated Graph Learning with Large Language Models

Authors: Bo Yan, Zhongjian Zhang, Huabin Sun, Mengmei Zhang, Yang Cao, Chuan Shi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19455
Pdf URL: https://arxiv.org/pdf/2503.19455
Copy Paste: [[2503.19455]] Data-centric Federated Graph Learning with Large Language Models(https://arxiv.org/abs/2503.19455)
Keywords: privacy, federate, large language model
Abstract: In federated graph learning (FGL), a complete graph is divided into multiple subgraphs stored in each client due to privacy concerns, and all clients jointly train a global graph model by only transmitting model parameters. A pain point of FGL is the heterogeneity problem, where nodes or structures present non-IID properties among clients (e.g., different node label distributions), dramatically undermining the convergence and performance of FGL. To address this, existing efforts focus on design strategies at the model level, i.e., they design models to extract common knowledge to mitigate heterogeneity. However, these model-level strategies fail to fundamentally address the heterogeneity problem as the model needs to be designed from scratch when transferring to other tasks. Motivated by large language models (LLMs) having achieved remarkable success, we aim to utilize LLMs to fully understand and augment local text-attributed graphs, to address data heterogeneity at the data level. In this paper, we propose a general framework LLM4FGL that innovatively decomposes the task of LLM for FGL into two sub-tasks theoretically. Specifically, for each client, it first utilizes the LLM to generate missing neighbors and then infers connections between generated nodes and raw nodes. To improve the quality of generated nodes, we design a novel federated generation-and-reflection mechanism for LLMs, without the need to modify the parameters of the LLM but relying solely on the collective feedback from all clients. After neighbor generation, all the clients utilize a pre-trained edge predictor to infer the missing edges. Furthermore, our framework can seamlessly integrate as a plug-in with existing FGL methods. Experiments on three real-world datasets demonstrate the superiority of our method compared to advanced baselines.

Title: G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

Authors: Juntao Jian, Xiuping Liu, Zixuan Chen, Manyi Li, Jian Liu, Ruizhen Hu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19457
Pdf URL: https://arxiv.org/pdf/2503.19457
Copy Paste: [[2503.19457]] G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation(https://arxiv.org/abs/2503.19457)
Keywords: generative
Abstract: Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: this https URL

Title: AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Authors: Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19462
Pdf URL: https://arxiv.org/pdf/2503.19462
Copy Paste: [[2503.19462]] AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset(https://arxiv.org/abs/2503.19462)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Title: Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning

Authors: Fred Philippy, Siwen Guo, Cedric Lothritz, Jacques Klein, Tegawendé F. Bissyandé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19469
Pdf URL: https://arxiv.org/pdf/2503.19469
Copy Paste: [[2503.19469]] Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning(https://arxiv.org/abs/2503.19469)
Keywords: robust
Abstract: In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.

Title: A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Authors: Yaomin Shen, Xiaojian Lin, Wei Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19474
Pdf URL: https://arxiv.org/pdf/2503.19474
Copy Paste: [[2503.19474]] A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition(https://arxiv.org/abs/2503.19474)
Keywords: large language model
Abstract: In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Mul- timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embed- ding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the pro- cess by synchronizing multimodal representation with label de- scriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

Title: Extracting Interpretable Logic Rules from Graph Neural Networks

Authors: Chuqin Geng, Zhaoyue Wang, Ziyu Zhao, Haolin Ye, Xujie Si
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19476
Pdf URL: https://arxiv.org/pdf/2503.19476
Copy Paste: [[2503.19476]] Extracting Interpretable Logic Rules from Graph Neural Networks(https://arxiv.org/abs/2503.19476)
Keywords: interpretability, explainability
Abstract: Graph neural networks (GNNs) operate over both input feature spaces and combinatorial graph structures, making it challenging to understand the rationale behind their predictions. As GNNs gain widespread popularity and demonstrate success across various domains, such as drug discovery, studying their interpretability has become a critical task. To address this, many explainability methods have been proposed, with recent efforts shifting from instance-specific explanations to global concept-based explainability. However, these approaches face several limitations, such as relying on predefined concepts and explaining only a limited set of patterns. To address this, we propose a novel framework, LOGICXGNN, for extracting interpretable logic rules from GNNs. LOGICXGNN is model-agnostic, efficient, and data-driven, eliminating the need for predefined concepts. More importantly, it can serve as a rule-based classifier and even outperform the original neural models. Its interpretability facilitates knowledge discovery, as demonstrated by its ability to extract detailed and accurate chemistry knowledge that is often overlooked by existing methods. Another key advantage of LOGICXGNN is its ability to generate new graph instances in a controlled and transparent manner, offering significant potential for applications such as drug design. We empirically demonstrate these merits through experiments on real-world datasets such as MUTAG and BBBP.

Title: TeLL Me what you cant see

Authors: Saverio Cavasin, Pietro Biasetton, Mattia Tamiazzo, Mauro Conti, Simone Milani
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.19478
Pdf URL: https://arxiv.org/pdf/2503.19478
Copy Paste: [[2503.19478]] TeLL Me what you cant see(https://arxiv.org/abs/2503.19478)
Keywords: robust, biometric, extraction
Abstract: During criminal investigations, images of persons of interest directly influence the success of identification procedures. However, law enforcement agencies often face challenges related to the scarcity of high-quality images or their obsolescence, which can affect the accuracy and success of people searching processes. This paper introduces a novel forensic mugshot augmentation framework aimed at addressing these limitations. Our approach enhances the identification probability of individuals by generating additional, high-quality images through customizable data augmentation techniques, while maintaining the biometric integrity and consistency of the original data. Several experimental results show that our method significantly improves identification accuracy and robustness across various forensic scenarios, demonstrating its effectiveness as a trustworthy tool law enforcement applications. Index Terms: Digital Forensics, Person re-identification, Feature extraction, Data augmentation, Visual-Language models.

Title: GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Authors: Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19480
Pdf URL: https://arxiv.org/pdf/2503.19480
Copy Paste: [[2503.19480]] GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers(https://arxiv.org/abs/2503.19480)
Keywords: generative, large language model
Abstract: The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

Title: KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

Authors: Zhiwei Wang, Zhongxin Liu, Ying Li, Hongyu Sun, Meng Xu, Yuqing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19482
Pdf URL: https://arxiv.org/pdf/2503.19482
Copy Paste: [[2503.19482]] KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models(https://arxiv.org/abs/2503.19482)
Keywords: robust, generative, large language model
Abstract: The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.

Title: Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

Authors: Zhengwentai Sun, Heyuan Li, Xihe Yang, Keru Zheng, Shuliang Ning, Yihao Zhi, Hongjie Liao, Chenghong Li, Shuguang Cui, Xiaoguang Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19486
Pdf URL: https://arxiv.org/pdf/2503.19486
Copy Paste: [[2503.19486]] Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage(https://arxiv.org/abs/2503.19486)
Keywords: generative
Abstract: Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: this https URL.

Title: SMT-EX: An Explainable Surrogate Modeling Toolbox for Mixed-Variables Design Exploration

Authors: Mohammad Daffa Robani, Paul Saves, Pramudita Satria Palar, Lavi Rizki Zuhal, oseph Morlier
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.19496
Pdf URL: https://arxiv.org/pdf/2503.19496
Copy Paste: [[2503.19496]] SMT-EX: An Explainable Surrogate Modeling Toolbox for Mixed-Variables Design Exploration(https://arxiv.org/abs/2503.19496)
Keywords: extraction, explainability
Abstract: Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: this https URL .

Title: DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts

Authors: Ling Zhong, Yujing Lu, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19498
Pdf URL: https://arxiv.org/pdf/2503.19498
Copy Paste: [[2503.19498]] DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts(https://arxiv.org/abs/2503.19498)
Keywords: large language model
Abstract: Chart Question Answering (CQA) benchmarks are essential for evaluating the capability of Multimodal Large Language Models (MLLMs) to interpret visual data. However, current benchmarks focus primarily on the evaluation of general-purpose CQA but fail to adequately capture domain-specific challenges. We introduce DomainCQA, a systematic methodology for constructing domain-specific CQA benchmarks, and demonstrate its effectiveness by developing AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows that chart reasoning and combining chart information with domain knowledge for deeper analysis and summarization, rather than domain-specific knowledge, pose the primary challenge for existing MLLMs, highlighting a critical gap in current benchmarks. By providing a scalable and rigorous framework, DomainCQA enables more precise assessment and improvement of MLLMs for domain-specific applications.

Title: SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling

Authors: Yaofei Wang, Gang Pei, Kejiang Chen, Jinyang Ding, Chao Pan, Weilong Pang, Donghui Hu, Weiming Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19499
Pdf URL: https://arxiv.org/pdf/2503.19499
Copy Paste: [[2503.19499]] SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling(https://arxiv.org/abs/2503.19499)
Keywords: secure, security, extraction, generative
Abstract: Steganography embeds confidential data within seemingly innocuous communications. Provable security in steganography, a long-sought goal, has become feasible with deep generative models. However, existing methods face a critical trade-off between security and efficiency. This paper introduces SparSamp, an efficient provably secure steganography method based on sparse sampling. SparSamp embeds messages by combining them with pseudo-random numbers to obtain message-derived random numbers for sampling. It enhances extraction accuracy and embedding capacity by increasing the sampling intervals and making the sampling process sparse. SparSamp preserves the original probability distribution of the generative model, thus ensuring security. It introduces only $O(1)$ additional complexity per sampling step, enabling the fastest embedding speed without compromising generation speed. SparSamp is designed to be plug-and-play; message embedding can be achieved by simply replacing the sampling component of an existing generative model with SparSamp. We implemented SparSamp in text, image, and audio generation models. It can achieve embedding speeds of up to 755 bits/second with GPT-2, 5046 bits/second with DDPM, and 9,223 bits/second with WaveRNN.

Title: Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

Authors: Vinayak Mali, Saurabh Jaiswal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19501
Pdf URL: https://arxiv.org/pdf/2503.19501
Copy Paste: [[2503.19501]] Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs(https://arxiv.org/abs/2503.19501)
Keywords: robust
Abstract: Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated hardware, or on video-based models that demand high computational resources and GPUs for real-time processing. In contrast, this paper presents a robust fall detection system that does not require any additional sensors or high-powered hardware. The system uses pose estimation techniques, combined with threshold-based analysis and a voting mechanism, to effectively distinguish between fall and non-fall activities. For pose detection, we leverage MediaPipe, a lightweight and efficient framework that enables real-time processing on standard CPUs with minimal computational overhead. By analyzing motion, body position, and key pose points, the system processes pose features with a 20-frame buffer, minimizing false positives and maintaining high accuracy even in real-world settings. This unobtrusive, resource-efficient approach provides a practical solution for enhancing resident safety in old age homes, without the need for expensive sensors or high-end computational resources.

Title: Improved Alignment of Modalities in Large Vision Language Models

Authors: Kartik Jangra, Aman Kumar Singh, Yashwani Mann, Geetanjali Rathee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19508
Pdf URL: https://arxiv.org/pdf/2503.19508
Copy Paste: [[2503.19508]] Improved Alignment of Modalities in Large Vision Language Models(https://arxiv.org/abs/2503.19508)
Keywords: transformer
Abstract: Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

Title: Towards Imperceptible Adversarial Attacks for Time Series Classification with Local Perturbations and Frequency Analysis

Authors: Wenwei Gu, Renyi Zhong, Jianping Zhang, Michael R. Lyu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19519
Pdf URL: https://arxiv.org/pdf/2503.19519
Copy Paste: [[2503.19519]] Towards Imperceptible Adversarial Attacks for Time Series Classification with Local Perturbations and Frequency Analysis(https://arxiv.org/abs/2503.19519)
Keywords: attack, robust, steal
Abstract: Adversarial attacks in time series classification (TSC) models have recently gained attention due to their potential to compromise model robustness. Imperceptibility is crucial, as adversarial examples detected by the human vision system (HVS) can render attacks ineffective. Many existing methods fail to produce high-quality imperceptible examples, often generating perturbations with more perceptible low-frequency components, like square waves, and global perturbations that reduce stealthiness. This paper aims to improve the imperceptibility of adversarial attacks on TSC models by addressing frequency components and time series locality. We propose the Shapelet-based Frequency-domain Attack (SFAttack), which uses local perturbations focused on time series shapelets to enhance discriminative information and stealthiness. Additionally, we introduce a low-frequency constraint to confine perturbations to high-frequency components, enhancing imperceptibility.

Title: FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models

Authors: Dahyun Jung, Seungyoon Lee, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19540
Pdf URL: https://arxiv.org/pdf/2503.19540
Copy Paste: [[2503.19540]] FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models(https://arxiv.org/abs/2503.19540)
Keywords: robust, fair, large language model
Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.

Title: Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images

Authors: Elena Buglakova, Anwai Archit, Edoardo D'Imprima, Julia Mahamid, Constantin Pape, Anna Kreshuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19545
Pdf URL: https://arxiv.org/pdf/2503.19545
Copy Paste: [[2503.19545]] Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images(https://arxiv.org/abs/2503.19545)
Keywords: segmentation
Abstract: Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.

Title: Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts

Authors: Jan Kohút, Michal Hradiš
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19546
Pdf URL: https://arxiv.org/pdf/2503.19546
Copy Paste: [[2503.19546]] Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts(https://arxiv.org/abs/2503.19546)
Keywords: transformer
Abstract: A common use case for OCR applications involves users uploading documents and progressively correcting automatic recognition to obtain the final transcript. This correction phase presents an opportunity for progressive adaptation of the OCR model, making it crucial to adapt early, while ensuring stability and reliability. We demonstrate that state-of-the-art transformer-based models can effectively support this adaptation, gradually reducing the annotator's workload. Our results show that fine-tuning can reliably start with just 16 lines, yielding a 10% relative improvement in CER, and scale up to 40% with 256 lines. We further investigate the impact of model components, clarifying the roles of the encoder and decoder in the fine-tuning process. To guide adaptation, we propose reliable stopping criteria, considering both direct approaches and global trend analysis. Additionally, we show that OCR models can be leveraged to cut annotation costs by half through confidence-based selection of informative lines, achieving the same performance with fewer annotations.

Title: Noise Resilient Over-The-Air Federated Learning In Heterogeneous Wireless Networks

Authors: Zubair Shaban, Nazreen Shah, Ranjitha Prasad
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.19549
Pdf URL: https://arxiv.org/pdf/2503.19549
Copy Paste: [[2503.19549]] Noise Resilient Over-The-Air Federated Learning In Heterogeneous Wireless Networks(https://arxiv.org/abs/2503.19549)
Keywords: privacy, robust, federate
Abstract: In 6G wireless networks, Artificial Intelligence (AI)-driven applications demand the adoption of Federated Learning (FL) to enable efficient and privacy-preserving model training across distributed devices. Over-The-Air Federated Learning (OTA-FL) exploits the superposition property of multiple access channels, allowing edge users in 6G networks to efficiently share spectral resources and perform low-latency global model aggregation. However, these advantages come with challenges, as traditional OTA-FL techniques suffer due to the joint effects of Additive White Gaussian Noise (AWGN) at the server, fading, and both data and system heterogeneity at the participating edge devices. In this work, we propose the novel Noise Resilient Over-the-Air Federated Learning (NoROTA-FL) framework to jointly tackle these challenges in federated wireless networks. In NoROTA-FL, the local optimization problems find controlled inexact solutions, which manifests as an additional proximal constraint at the clients. This approach provides robustness against straggler-induced partial work, heterogeneity, noise, and fading. From a theoretical perspective, we leverage the zeroth- and first-order inexactness and establish convergence guarantees for non-convex optimization problems in the presence of heterogeneous data and varying system capabilities. Experimentally, we validate NoROTA-FL on real-world datasets, including FEMNIST, CIFAR10, and CIFAR100, demonstrating its robustness in noisy and heterogeneous environments. Compared to state-of-the-art baselines such as COTAF and FedProx, NoROTA-FL achieves significantly more stable convergence and higher accuracy, particularly in the presence of stragglers.

Title: Scaling Laws of Synthetic Data for Language Models

Authors: Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19551
Pdf URL: https://arxiv.org/pdf/2503.19551
Copy Paste: [[2503.19551]] Scaling Laws of Synthetic Data for Language Models(https://arxiv.org/abs/2503.19551)
Keywords: large language model
Abstract: Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the \emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

Title: Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion

Authors: Haim Sawdayee, Chuan Guo, Guy Tevet, Bing Zhou, Jian Wang, Amit H. Bermano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19557
Pdf URL: https://arxiv.org/pdf/2503.19557
Copy Paste: [[2503.19557]] Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion(https://arxiv.org/abs/2503.19557)
Keywords: diffusion, generative
Abstract: Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a "Chicken" style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.

Title: FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments

Authors: Sree Bhargavi Balija
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19564
Pdf URL: https://arxiv.org/pdf/2503.19564
Copy Paste: [[2503.19564]] FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments(https://arxiv.org/abs/2503.19564)
Keywords: robust, federate, interpretability
Abstract: As artificial intelligence systems increasingly operate in Real-world environments, the integration of multi-modal data sources such as vision, language, and audio presents both unprecedented opportunities and critical challenges for achieving trustworthy intelligence. In this paper, we propose a novel framework that unifies federated learning with explainable multi-modal reasoning to ensure trustworthiness in decentralized, dynamic settings. Our approach, called FedMM-X (Federated Multi-Modal Explainable Intelligence), leverages cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration to address challenges posed by data heterogeneity, modality imbalance, and out-of-distribution generalization. Through rigorous evaluation across federated multi-modal benchmarks involving vision-language tasks, we demonstrate improved performance in both accuracy and interpretability while reducing vulnerabilities to adversarial and spurious correlations. Further, we introduce a novel trust score aggregation method to quantify global model reliability under dynamic client participation. Our findings pave the way toward developing robust, interpretable, and socially responsible AI systems in Real-world environments.

Title: Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study

Authors: Olgica Zaric, Carmen Leser, Vladimir Juras, Alex Farr, Malina Gologan, Stanislas Rapacchi, Laura Villazan Garcia, Christian Singer, Siegfried Trattnig, Christian Licht, Ramona Woitek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19570
Pdf URL: https://arxiv.org/pdf/2503.19570
Copy Paste: [[2503.19570]] Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study(https://arxiv.org/abs/2503.19570)
Keywords: robust
Abstract: Introduction: In sodium (23Na) MRI, partial volume effects (PVE) are one of the most common causes of errors in the quantification of tissue sodium concentration (TSC) in vivo. Advanced image reconstruction algorithms, such as compressed sensing (CS), have been shown to potentially reduce PVE. Therefore, we investigated the feasibility of CS-based methods for image quality and TSC quantification accuracy improvement in patients with breast cancer (BC). Subjects and Methods: Three healthy participants and 12 female participants with BC were examined on a 7T MRI scanner in this study. We reconstructed 23Na-MRI images using the weighted total variation (wTV) and directional total variation (dTV), anatomically guided total variation (AG-TV), and adaptive combine (ADC) reconstruction and performed image quality assessment. We evaluated agreement in tumor volumes delineated on sodium data using the Dice score and performed TSC quantification for different image reconstruction approaches. Results: All methods provided sodium images of the breast with good quality. The mean Dice scores for wTV, dTV, and AG-TV were 65%, 72%, and 75%, respectively. In the breast tumors, average TSC values were 83.0, 72.0, 80.0, and 84.0 mmol/L, respectively. There was a significant difference between dTV and wTV (p<0.001), as well as between dTV and AG-TV (p<0.001) and dTV and ADC algorithm (p<0.001). Conclusion: The results of this study showed that there are differences in tumor appearance and TSC estimations that might be depending on the type of image reconstruction and parameters used, most likely due to differences in their robustness in reducing PVE.

Title: Context-Efficient Retrieval with Factual Decomposition

Authors: Yanhong Li, David Yunis, David McAllester, Jiawei Zhou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.19574
Pdf URL: https://arxiv.org/pdf/2503.19574
Copy Paste: [[2503.19574]] Context-Efficient Retrieval with Factual Decomposition(https://arxiv.org/abs/2503.19574)
Keywords: large language model
Abstract: There has recently been considerable interest in incorporating information retrieval into large language models (LLMs). Retrieval from a dynamically expanding external corpus of text allows a model to incorporate current events and can be viewed as a form of episodic memory. Here we demonstrate that pre-processing the external corpus into semi-structured ''atomic facts'' makes retrieval more efficient. More specifically, we demonstrate that our particular form of atomic facts improves performance on various question answering tasks when the amount of retrieved text is limited. Limiting the amount of retrieval reduces the size of the context and improves inference efficiency.

Title: Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems

Authors: M. Rizki Oktavian, Anirudh Tunga, Amandeep Bakshi, Michael J. Mueterthies, J. Thomas Gruenwald, Jonathan Nistor
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2503.19620
Pdf URL: https://arxiv.org/pdf/2503.19620
Copy Paste: [[2503.19620]] Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems(https://arxiv.org/abs/2503.19620)
Keywords: large language model
Abstract: The optimization of nuclear engineering designs, such as nuclear fuel assembly configurations, involves managing competing objectives like reactivity control and power distribution. This study explores the use of Optimization by Prompting, an iterative approach utilizing large language models (LLMs), to address these challenges. The method is straightforward to implement, requiring no hyperparameter tuning or complex mathematical formulations. Optimization problems can be described in plain English, with only an evaluator and a parsing script needed for execution. The in-context learning capabilities of LLMs enable them to understand problem nuances, therefore, they have the potential to surpass traditional metaheuristic optimization methods. This study demonstrates the application of LLMs as optimizers to Boiling Water Reactor (BWR) fuel lattice design, showing the capability of commercial LLMs to achieve superior optimization results compared to traditional methods.

Title: DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios

Authors: Xiangting Meng, Jiaqi Yang, Mingshu Chen, Chenxin Yan, Yujiao Shi, Wenchao Ding, Laurent Kneip
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19625
Pdf URL: https://arxiv.org/pdf/2503.19625
Copy Paste: [[2503.19625]] DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios(https://arxiv.org/abs/2503.19625)
Keywords: robust
Abstract: In the realm of object pose estimation, scenarios involving both dynamic objects and moving cameras are prevalent. However, the scarcity of corresponding real-world datasets significantly hinders the development and evaluation of robust pose estimation models. This is largely attributed to the inherent challenges in accurately annotating object poses in dynamic scenes captured by moving cameras. To bridge this gap, this paper presents a novel dataset DynOPETs and a dedicated data acquisition and annotation pipeline tailored for object pose estimation and tracking in such unconstrained environments. Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels, which are subsequently refined through pose graph optimization. The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras. To validate the effectiveness and value of our dataset, we perform comprehensive evaluations using 18 state-of-the-art methods, demonstrating its potential to accelerate research in this challenging domain. The dataset will be made publicly available to facilitate further exploration and advancement in the field.

Title: Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review

Authors: Mays Al-Azzawi, Dung Doan, Tuomo Sipola, Jari Hautamäki, Tero Kokkonen
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19626
Pdf URL: https://arxiv.org/pdf/2503.19626
Copy Paste: [[2503.19626]] Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review(https://arxiv.org/abs/2503.19626)
Keywords: security, attack
Abstract: The progress of artificial intelligence (AI) has made sophisticated methods available for cyberattacks and red team activities. These AI attacks can automate the process of penetrating a target or collecting sensitive data. The new methods can also accelerate the execution of the attacks. This review article examines the use of AI technologies in cybersecurity attacks. It also tries to describe typical targets for such attacks. We employed a scoping review methodology to analyze articles and identify AI methods, targets, and models that red teams can utilize to simulate cybercrime. From the 470 records screened, 11 were included in the review. Various cyberattack methods were identified, targeting sensitive data, systems, social media profiles, passwords, and URLs. The application of AI in cybercrime to develop versatile attack models presents an increasing threat. Furthermore, AI-based techniques in red team use can provide new ways to address these issues.

Title: 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Authors: Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19633
Pdf URL: https://arxiv.org/pdf/2503.19633
Copy Paste: [[2503.19633]] 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training(https://arxiv.org/abs/2503.19633)
Keywords: large language model
Abstract: The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{this https URL}{this https URL}.

Title: Burst Image Super-Resolution with Mamba

Authors: Ozan Unal, Steven Marty, Dengxin Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19634
Pdf URL: https://arxiv.org/pdf/2503.19634
Copy Paste: [[2503.19634]] Burst Image Super-Resolution with Mamba(https://arxiv.org/abs/2503.19634)
Keywords: extraction, transformer
Abstract: Burst image super-resolution (BISR) aims to enhance the resolution of a keyframe by leveraging information from multiple low-resolution images captured in quick succession. In the deep learning era, BISR methods have evolved from fully convolutional networks to transformer-based architectures, which, despite their effectiveness, suffer from the quadratic complexity of self-attention. We see Mamba as the next natural step in the evolution of this field, offering a comparable global receptive field and selective information routing with only linear time complexity. In this work, we introduce BurstMamba, a Mamba-based architecture for BISR. Our approach decouples the task into two specialized branches: a spatial module for keyframe super-resolution and a temporal module for subpixel prior extraction, striking a balance between computational efficiency and burst information integration. To further enhance burst processing with Mamba, we propose two novel strategies: (i) optical flow-based serialization, which aligns burst sequences only during state updates to preserve subpixel details, and (ii) a wavelet-based reparameterization of the state-space update rules, prioritizing high-frequency features for improved burst-to-keyframe information passing. Our framework achieves SOTA performance on public benchmarks of SyntheticSR, RealBSR-RGB, and RealBSR-RAW.

Title: Substation Bill of Materials: A Novel Approach to Managing Supply Chain Cyber-risks on IEC 61850 Digital Substations

Authors: Xabier Yurrebaso, Fernando Ibañez, Ángel Longueira-Romero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19638
Pdf URL: https://arxiv.org/pdf/2503.19638
Copy Paste: [[2503.19638]] Substation Bill of Materials: A Novel Approach to Managing Supply Chain Cyber-risks on IEC 61850 Digital Substations(https://arxiv.org/abs/2503.19638)
Keywords: security, attack
Abstract: Smart grids have undergone a profound digitization process, integrating new data-driven control and supervision techniques, resulting in modern digital substations (DS). Attackers are more focused on attacking the supply chain of the DS, as they a comprise a multivendor environment. In this research work, we present the Substation Bill of Materials (Subs-BOM) schema, based on the CycloneDX specification, that is capable of modeling all the IEDs in a DS and their relationships from a cybersecurity perspective. The proposed Subs-BOM allows one to make informed decisions about cyber risks related to the supply chain, and enables managing multiple DS at the same time. This provides energy utilities with an accurate and complete inventory of the devices, the firmware they are running, and the services that are deployed into the DS. The Subs-BOM is generated using the Substation Configuration Description (SCD) file specified in the IEC 61850 standard as its main source of information. We validated the Subs-BOM schema against the Dependency-Track software by OWASP. This validation proved that the schema is correctly recognized by CycloneDX-compatible tools. Moreover, the Dependency-Track software could track existing vulnerabilities in the IEDs represented by the Subs-BOM.

Title: An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

Authors: Tseng-Jen Li, Tian-Sheuan Chang
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2503.19640
Pdf URL: https://arxiv.org/pdf/2503.19640
Copy Paste: [[2503.19640]] An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators(https://arxiv.org/abs/2503.19640)
Keywords: transformer
Abstract: Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

Title: Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

Authors: Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, Cristiano Malossi, Konrad Schindler, Roy Assaf
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19647
Pdf URL: https://arxiv.org/pdf/2503.19647
Copy Paste: [[2503.19647]] Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation(https://arxiv.org/abs/2503.19647)
Keywords: segmentation
Abstract: Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.

Title: HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection

Authors: Maryam Bala, Amina Imam Abubakar, Abdulhamid Abubakar, Abdulkadir Shehu Bichi, Hafsa Kabir Ahmad, Sani Abdullahi Sani, Idris Abdulmumin, Shamsuddeen Hassan Muhamad, Ibrahim Said Ahmad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19650
Pdf URL: https://arxiv.org/pdf/2503.19650
Copy Paste: [[2503.19650]] HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection(https://arxiv.org/abs/2503.19650)
Keywords: large language model
Abstract: This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.

Title: Enhancing Graphical Lasso: A Robust Scheme for Non-Stationary Mean Data

Authors: Samuel Rey, Ernesto Curbelo, Luca Martino, Fernando Llorente, Antonio G. Marques
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19651
Pdf URL: https://arxiv.org/pdf/2503.19651
Copy Paste: [[2503.19651]] Enhancing Graphical Lasso: A Robust Scheme for Non-Stationary Mean Data(https://arxiv.org/abs/2503.19651)
Keywords: robust
Abstract: This work addresses the problem of graph learning from data following a Gaussian Graphical Model (GGM) with a time-varying mean. Graphical Lasso (GL), the standard method for estimating sparse precision matrices, assumes that the observed data follows a zero-mean Gaussian distribution. However, this assumption is often violated in real-world scenarios where the mean evolves over time due to external influences, trends, or regime shifts. When the mean is not properly accounted for, applying GL directly can lead to estimating a biased precision matrix, hence hindering the graph learning task. To overcome this limitation, we propose Graphical Lasso with Adaptive Targeted Adaptive Importance Sampling (GL-ATAIS), an iterative method that jointly estimates the time-varying mean and the precision matrix. Our approach integrates Bayesian inference with frequentist estimation, leveraging importance sampling to obtain an estimate of the mean while using a regularized maximum likelihood estimator to infer the precision matrix. By iteratively refining both estimates, GL-ATAIS mitigates the bias introduced by time-varying means, leading to more accurate graph recovery. Our numerical evaluation demonstrates the impact of properly accounting for time-dependent means and highlights the advantages of GL-ATAIS over standard GL in recovering the true graph structure.

Title: OpenSDI: Spotting Diffusion-Generated Images in the Open World

Authors: Yabin Wang, Zhiwu Huang, Xiaopeng Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19653
Pdf URL: https://arxiv.org/pdf/2503.19653
Copy Paste: [[2503.19653]] OpenSDI: Spotting Diffusion-Generated Images in the Open World(https://arxiv.org/abs/2503.19653)
Keywords: diffusion
Abstract: This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at this https URL.

Title: RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Authors: Mehdi Moshtaghi, Siavash H. Khajavi, Joni Pajarinen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19654
Pdf URL: https://arxiv.org/pdf/2503.19654
Copy Paste: [[2503.19654]] RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models(https://arxiv.org/abs/2503.19654)
Keywords: robust
Abstract: We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

Title: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

Authors: Jan Kohút, Martin Dočekal, Michal Hradiš, Marek Vaško
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19658
Pdf URL: https://arxiv.org/pdf/2503.19658
Copy Paste: [[2503.19658]] BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction(https://arxiv.org/abs/2503.19658)
Keywords: extraction, transformer, large language model
Abstract: Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: this https URL

Title: CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

Authors: Rupak Bose, Chinedu Innocent Nwoye, Aditya Bhat, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19661
Pdf URL: https://arxiv.org/pdf/2503.19661
Copy Paste: [[2503.19661]] CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation(https://arxiv.org/abs/2503.19661)
Keywords: diffusion, generative, segmentation
Abstract: The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

Title: A multitask transformer to sign language translation using motion gesture primitives

Authors: Fredy Alejandro Mendoza López, Jefferson Rodriguez, Fabio Martínez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19668
Pdf URL: https://arxiv.org/pdf/2503.19668
Copy Paste: [[2503.19668]] A multitask transformer to sign language translation using motion gesture primitives(https://arxiv.org/abs/2503.19668)
Keywords: transformer
Abstract: The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.

Title: fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

Authors: Saurav Sharma, Didier Mutter, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19670
Pdf URL: https://arxiv.org/pdf/2503.19670
Copy Paste: [[2503.19670]] fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models(https://arxiv.org/abs/2503.19670)
Keywords: extraction
Abstract: While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

Title: Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

Authors: Andrii Yermakov, Jan Cech, Jiri Matas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19683
Pdf URL: https://arxiv.org/pdf/2503.19683
Copy Paste: [[2503.19683]] Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection(https://arxiv.org/abs/2503.19683)
Keywords: robust
Abstract: This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: this https URL

Title: AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Authors: Itay Nakash, Nitay Calderon, Eyal Ben David, Elad Hoffer, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19693
Pdf URL: https://arxiv.org/pdf/2503.19693
Copy Paste: [[2503.19693]] AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation(https://arxiv.org/abs/2503.19693)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance

Title: Optimization of MedSAM model based on bounding box adaptive perturbation algorithm

Authors: Boyi Li, Ye Yuan, Wenjun Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19700
Pdf URL: https://arxiv.org/pdf/2503.19700
Copy Paste: [[2503.19700]] Optimization of MedSAM model based on bounding box adaptive perturbation algorithm(https://arxiv.org/abs/2503.19700)
Keywords: robust, segmentation
Abstract: The MedSAM model, built upon the SAM framework, enhances medical image segmentation through generalizable training but still exhibits notable limitations. First, constraints in the perturbation window settings during training can cause MedSAM to incorrectly segment small tissues or organs together with adjacent structures, leading to segmentation errors. Second, when dealing with medical image targets characterized by irregular shapes and complex structures, segmentation often relies on narrowing the bounding box to refine segmentation intent. However, MedSAM's performance under reduced bounding box prompts remains suboptimal. To address these challenges, this study proposes a bounding box adaptive perturbation algorithm to optimize the training process. The proposed approach aims to reduce segmentation errors for small targets and enhance the model's accuracy when processing reduced bounding box prompts, ultimately improving the robustness and reliability of the MedSAM model for complex medical imaging tasks.

Title: Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19706
Pdf URL: https://arxiv.org/pdf/2503.19706
Copy Paste: [[2503.19706]] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations(https://arxiv.org/abs/2503.19706)
Keywords: robust
Abstract: View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at this https URL.

Title: On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

Authors: Francisco Mena, Diego Arenas, Miro Miranda, Andreas Dengel
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19719
Pdf URL: https://arxiv.org/pdf/2503.19719
Copy Paste: [[2503.19719]] On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?(https://arxiv.org/abs/2503.19719)
Keywords: robust
Abstract: In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

Title: EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction

Authors: Chengjie Ge, Xueyang Fu, Peng He, Kunyu Wang, Chengzhi Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19721
Pdf URL: https://arxiv.org/pdf/2503.19721
Copy Paste: [[2503.19721]] EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction(https://arxiv.org/abs/2503.19721)
Keywords: robust, transformer
Abstract: Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mamba-based vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba--a specialized model designed for EBVR tasks. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba's robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.

Title: CamSAM2: Segment Anything Accurately in Camouflaged Videos

Authors: Yuli Zhou, Guolei Sun, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19730
Pdf URL: https://arxiv.org/pdf/2503.19730
Copy Paste: [[2503.19730]] CamSAM2: Segment Anything Accurately in Camouflaged Videos(https://arxiv.org/abs/2503.19730)
Keywords: segmentation
Abstract: Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at \href{this https URL}{this http URL}.

Title: PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models

Authors: Junhyuk So, Jiwoong Shin, Chaeyeon Jang, Eunhyeok Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19731
Pdf URL: https://arxiv.org/pdf/2503.19731
Copy Paste: [[2503.19731]] PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models(https://arxiv.org/abs/2503.19731)
Keywords: diffusion
Abstract: Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM's limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.

Title: How to RETIRE Tabular Data in Favor of Discrete Digital Signal Representation

Authors: Paweł Zyblewski, Szymon Wojciechowski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19733
Pdf URL: https://arxiv.org/pdf/2503.19733
Copy Paste: [[2503.19733]] How to RETIRE Tabular Data in Favor of Discrete Digital Signal Representation(https://arxiv.org/abs/2503.19733)
Keywords: explainability
Abstract: The successes achieved by deep neural networks in computer vision tasks have led in recent years to the emergence of a new research area dubbed Multi-Dimensional Encoding (MDE). Methods belonging to this family aim to transform tabular data into a homogeneous form of discrete digital signals (images) to apply convolutional networks to initially unsuitable problems. Despite the successive emerging works, the pool of multi-dimensional encoding methods is still low, and the scope of research on existing modality encoding techniques is quite limited. To contribute to this area of research, we propose the Radar-based Encoding from Tabular to Image REpresentation (RETIRE), which allows tabular data to be represented as radar graphs, capturing the feature characteristics of each problem instance. RETIRE was compared with a pool of state-of-the-art MDE algorithms as well as with XGBoost in terms of classification accuracy and computational complexity. In addition, an analysis was carried out regarding transferability and explainability to provide more insight into both RETIRE and existing MDE techniques. The results obtained, supported by statistical analysis, confirm the superiority of RETIRE over other established MDE methods.

Title: FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Authors: Pihai Sun (1), Junjun Jiang (1), Yuanqi Yao (1), Youyu Chen (1), Wenbo Zhao (1), Kui Jiang (1), Xianming Liu (1) ((1) Faculty of Computing, Harbin Institute of Technology)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19739
Pdf URL: https://arxiv.org/pdf/2503.19739
Copy Paste: [[2503.19739]] FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion(https://arxiv.org/abs/2503.19739)
Keywords: robust
Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground this http URL this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in this http URL on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: this https URL

Title: ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19755
Pdf URL: https://arxiv.org/pdf/2503.19755
Copy Paste: [[2503.19755]] ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation(https://arxiv.org/abs/2503.19755)
Keywords: generative, large language model
Abstract: End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Title: OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

Authors: Christina Kassab, Sacha Morin, Martin Büchner, Matías Mattamala, Kumaraditya Gupta, Abhinav Valada, Liam Paull, Maurice Fallon
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19764
Pdf URL: https://arxiv.org/pdf/2503.19764
Copy Paste: [[2503.19764]] OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations(https://arxiv.org/abs/2503.19764)
Keywords: segmentation
Abstract: 3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities. We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. The benchmark is publicly available at: this https URL.

Title: A Managed Tokens Service for Securely Keeping and Distributing Grid Tokens

Authors: Shreyas Bhat (1), Dave Dykstra (1) ((1) Fermi National Accelerator Laboratory)
Subjects: cs.CR, physics.ins-det
Abstract URL: https://arxiv.org/abs/2503.19768
Pdf URL: https://arxiv.org/pdf/2503.19768
Copy Paste: [[2503.19768]] A Managed Tokens Service for Securely Keeping and Distributing Grid Tokens(https://arxiv.org/abs/2503.19768)
Keywords: secure, protect
Abstract: Fermilab is transitioning authentication and authorization for grid operations to using bearer tokens based on the WLCG Common JWT (JSON Web Token) Profile. One of the functionalities that Fermilab experimenters rely on is the ability to automate batch job submission, which in turn depends on the ability to securely refresh and distribute the necessary credentials to experiment job submit points. Thus, with the transition to using tokens for grid operations, we needed to create a service that would obtain, refresh, and distribute tokens for experimenters' use. This service would avoid the need for experimenters to be experts in obtaining their own tokens and would better protect the most sensitive long-lived credentials. Further, the service needed to be widely scalable, as Fermilab hosts many experiments, each of which would need their own credentials. To address these issues, we created and deployed a Managed Tokens Service. The service is written in Go, taking advantage of that language's native concurrency primitives to easily be able to scale operations as we onboard experiments. The service uses as its first credentials a set of kerberos keytabs, stored on the same secure machine that the Managed Tokens service runs on. These kerberos credentials allow the service to use htgettoken via condor_vault_storer to store vault tokens in the HTCondor credential managers (credds) that run on the batch system scheduler machines (HTCondor schedds); as well as downloading a local, shorter-lived copy of the vault token. The kerberos credentials are then also used to distribute copies of the locally-stored vault tokens to experiment submit points.

Title: BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Authors: Suzhe Xu, Jialin Peng, Chengyuan Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19769
Pdf URL: https://arxiv.org/pdf/2503.19769
Copy Paste: [[2503.19769]] BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts(https://arxiv.org/abs/2503.19769)
Keywords: segmentation
Abstract: Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network." We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

Title: Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion

Authors: Konyul Park, Yecheol Kim, Daehun Kim, Jun Won Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19776
Pdf URL: https://arxiv.org/pdf/2503.19776
Copy Paste: [[2503.19776]] Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion(https://arxiv.org/abs/2503.19776)
Keywords: robust
Abstract: Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach. Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of MoME on the nuScenes-R benchmark. Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.

Title: LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

Authors: Vladan Stojnić, Yannis Kalantidis, Jiří Matas, Giorgos Tolias
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19777
Pdf URL: https://arxiv.org/pdf/2503.19777
Copy Paste: [[2503.19777]] LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation(https://arxiv.org/abs/2503.19777)
Keywords: segmentation
Abstract: We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: this https URL

Title: PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Authors: Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19779
Pdf URL: https://arxiv.org/pdf/2503.19779
Copy Paste: [[2503.19779]] PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch(https://arxiv.org/abs/2503.19779)
Keywords: robust
Abstract: CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases. We introduce PyGraph, a novel approach to automatically harness the power of CUDA Graphs within PyTorch2. Driven by three key observations, PyGraph embodies three novel optimizations: it enables wider deployment of CUDA Graphs, reduces GPU kernel parameter copy overheads, and selectively deploys CUDA Graphs based on a cost-benefit analysis. PyGraph seamlessly integrates with PyTorch2's compilation toolchain, enabling efficient use of CUDA Graphs without manual modifications to the code. We evaluate PyGraph across various machine learning benchmarks, demonstrating substantial performance improvements over PyTorch2.

Title: Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

Authors: Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19783
Pdf URL: https://arxiv.org/pdf/2503.19783
Copy Paste: [[2503.19783]] Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models(https://arxiv.org/abs/2503.19783)
Keywords: diffusion, generative
Abstract: Existing unlearning algorithms in text-to-image generative models often fail to preserve the knowledge of semantically related concepts when removing specific target concepts: a challenge known as adjacency. To address this, we propose FADE (Fine grained Attenuation for Diffusion Erasure), introducing adjacency aware unlearning in diffusion models. FADE comprises two components: (1) the Concept Neighborhood, which identifies an adjacency set of related concepts, and (2) Mesh Modules, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving atleast a 12% improvement in retention performance over state-of-the-art methods.

Title: SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation

Authors: Jingdan Kang, Haoxin Yang, Yan Cai, Huaidong Zhang, Xuemiao Xu, Yong Du, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19791
Pdf URL: https://arxiv.org/pdf/2503.19791
Copy Paste: [[2503.19791]] SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation(https://arxiv.org/abs/2503.19791)
Keywords: protect, attack, robust, extraction, diffusion
Abstract: Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method's robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at this https URL.

Title: In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush

Authors: Vitaly Gnatyuk, Valeriia Koriukina Ilya Levoshevich, Pavel Nurminskiy, Guenter Wallner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19793
Pdf URL: https://arxiv.org/pdf/2503.19793
Copy Paste: [[2503.19793]] In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush(https://arxiv.org/abs/2503.19793)
Keywords: diffusion, generative
Abstract: With video games steadily increasing in complexity, automated generation of game content has found widespread interest. However, the task of 3D gaming map art creation remains underexplored to date due to its unique complexity and domain-specific challenges. While recent works have addressed related topics such as retro-style level generation and procedural terrain creation, these works primarily focus on simpler data distributions. To the best of our knowledge, we are the first to demonstrate the application of modern AI techniques for high-resolution texture manipulation in complex, highly detailed AAA 3D game environments. We introduce a novel Smart Brush for map editing, designed to assist artists in seamlessly modifying selected areas of a game map with minimal effort. By leveraging generative adversarial networks and diffusion models we propose two variants of the brush that enable efficient and context-aware generation. Our hybrid workflow aims to enhance both artistic flexibility and production efficiency, enabling the refinement of environments without manually reworking every detail, thus helping to bridge the gap between automation and creative control in game development. A comparative evaluation of our two methods with adapted versions of several state-of-the art models shows that our GAN-based brush produces the sharpest and most detailed outputs while preserving image context while the evaluated state-of-the-art models tend towards blurrier results and exhibit difficulties in maintaining contextual consistency.

Title: PAVE: Patching and Adapting Video Large Language Models

Authors: Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19794
Pdf URL: https://arxiv.org/pdf/2503.19794
Copy Paste: [[2503.19794]] PAVE: Patching and Adapting Video Large Language Models(https://arxiv.org/abs/2503.19794)
Keywords: large language model
Abstract: Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at this https URL.

Title: Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models

Authors: Ruixi You, Hecheng Jia, Feng Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19798
Pdf URL: https://arxiv.org/pdf/2503.19798
Copy Paste: [[2503.19798]] Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models(https://arxiv.org/abs/2503.19798)
Keywords: interpretability, diffusion
Abstract: Synthetic Aperture Radar (SAR) imagery provides all-weather, all-day, and high-resolution imaging capabilities but its unique imaging mechanism makes interpretation heavily reliant on expert knowledge, limiting interpretability, especially in complex target tasks. Translating SAR images into optical images is a promising solution to enhance interpretation and support downstream tasks. Most existing research focuses on scene-level translation, with limited work on object-level translation due to the scarcity of paired data and the challenge of accurately preserving contour and texture details. To address these issues, this study proposes a keypoint-guided diffusion model (KeypointDiff) for SAR-to-optical image translation of unpaired aircraft targets. This framework introduces supervision on target class and azimuth angle via keypoints, along with a training strategy for unpaired data. Based on the classifier-free guidance diffusion architecture, a class-angle guidance module (CAGM) is designed to integrate class and angle information into the diffusion generation process. Furthermore, adversarial loss and consistency loss are employed to improve image fidelity and detail quality, tailored for aircraft targets. During sampling, aided by a pre-trained keypoint detector, the model eliminates the requirement for manually labeled class and azimuth information, enabling automated SAR-to-optical translation. Experimental results demonstrate that the proposed method outperforms existing approaches across multiple metrics, providing an efficient and effective solution for object-level SAR-to-optical translation and downstream tasks. Moreover, the method exhibits strong zero-shot generalization to untrained aircraft types with the assistance of the keypoint detector.

Title: SemEval-2025 Task 9: The Food Hazard Detection Challenge

Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren, Juli Bakagianni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19800
Pdf URL: https://arxiv.org/pdf/2503.19800
Copy Paste: [[2503.19800]] SemEval-2025 Task 9: The Food Hazard Detection Challenge(https://arxiv.org/abs/2503.19800)
Keywords: large language model
Abstract: In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

Title: SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI

Authors: Zhiyang Liu, Dong Yang, Minghao Zhang, Hanyu Sun, Hong Wu, Huiying Wang, Wen Shen, Chao Chai, Shuang Xia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19801
Pdf URL: https://arxiv.org/pdf/2503.19801
Copy Paste: [[2503.19801]] SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI(https://arxiv.org/abs/2503.19801)
Keywords: segmentation
Abstract: Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.

Title: Bitstream Collisions in Neural Image Compression via Adversarial Perturbations

Authors: Jordan Madden, Lhamo Dorje, Xiaohua Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19817
Pdf URL: https://arxiv.org/pdf/2503.19817
Copy Paste: [[2503.19817]] Bitstream Collisions in Neural Image Compression via Adversarial Perturbations(https://arxiv.org/abs/2503.19817)
Keywords: security, attack, robust
Abstract: Neural image compression (NIC) has emerged as a promising alternative to classical compression techniques, offering improved compression ratios. Despite its progress towards standardization and practical deployment, there has been minimal exploration into it's robustness and security. This study reveals an unexpected vulnerability in NIC - bitstream collisions - where semantically different images produce identical compressed bitstreams. Utilizing a novel whitebox adversarial attack algorithm, this paper demonstrates that adding carefully crafted perturbations to semantically different images can cause their compressed bitstreams to collide exactly. The collision vulnerability poses a threat to the practical usability of NIC, particularly in security-critical applications. The cause of the collision is analyzed, and a simple yet effective mitigation method is presented.

Title: Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning

Authors: Pratibha Kumari, Afshin Bozorgpour, Daniel Reisenbüchler, Edgar Jost, Martina Crysandt, Christian Matek, Dorit Merhof
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19819
Pdf URL: https://arxiv.org/pdf/2503.19819
Copy Paste: [[2503.19819]] Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning(https://arxiv.org/abs/2503.19819)
Keywords: privacy, robust, generative
Abstract: White blood cell (WBC) classification plays a vital role in hematology for diagnosing various medical conditions. However, it faces significant challenges due to domain shifts caused by variations in sample sources (e.g., blood or bone marrow) and differing imaging conditions across hospitals. Traditional deep learning models often suffer from catastrophic forgetting in such dynamic environments, while foundation models, though generally robust, experience performance degradation when the distribution of inference data differs from that of the training data. To address these challenges, we propose a generative replay-based Continual Learning (CL) strategy designed to prevent forgetting in foundation models for WBC classification. Our method employs lightweight generators to mimic past data with a synthetic latent representation to enable privacy-preserving replay. To showcase the effectiveness, we carry out extensive experiments with a total of four datasets with different task ordering and four backbone models including ResNet50, RetCCL, CTransPath, and UNI. Experimental results demonstrate that conventional fine-tuning methods degrade performance on previously learned tasks and struggle with domain shifts. In contrast, our continual learning strategy effectively mitigates catastrophic forgetting, preserving model performance across varying domains. This work presents a practical solution for maintaining reliable WBC classification in real-world clinical settings, where data distributions frequently evolve.

Title: FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Authors: Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19839
Pdf URL: https://arxiv.org/pdf/2503.19839
Copy Paste: [[2503.19839]] FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model(https://arxiv.org/abs/2503.19839)
Keywords: diffusion
Abstract: Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at this https URL.

Title: A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Authors: Zhao Fang, Liang-Chun Wu, Xuening Kong, Spencer Dean Stewart
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19844
Pdf URL: https://arxiv.org/pdf/2503.19844
Copy Paste: [[2503.19844]] A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950(https://arxiv.org/abs/2503.19844)
Keywords: large language model, segmentation
Abstract: This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

Title: Attention IoU: Examining Biases in CelebA using Attention Maps

Authors: Aaron Serianni, Tyler Zhu, Vikram V. Ramaswamy, Olga Russakovsky
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19846
Pdf URL: https://arxiv.org/pdf/2503.19846
Copy Paste: [[2503.19846]] Attention IoU: Examining Biases in CelebA using Attention Maps(https://arxiv.org/abs/2503.19846)
Keywords: protect
Abstract: Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.

Title: FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19850
Pdf URL: https://arxiv.org/pdf/2503.19850
Copy Paste: [[2503.19850]] FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs(https://arxiv.org/abs/2503.19850)
Keywords: large language model
Abstract: Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average > 1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.

Title: Towards Online Multi-Modal Social Interaction Understanding

Authors: Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M. Rehg, Yapeng Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19851
Pdf URL: https://arxiv.org/pdf/2503.19851
Copy Paste: [[2503.19851]] Towards Online Multi-Modal Social Interaction Understanding(https://arxiv.org/abs/2503.19851)
Keywords: large language model
Abstract: Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: this https URL.

Title: Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19855
Pdf URL: https://arxiv.org/pdf/2503.19855
Copy Paste: [[2503.19855]] Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking(https://arxiv.org/abs/2503.19855)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: {last round answer} , and please re-answer.

Title: Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs

Authors: Alexander Ryabchenko, Idan Attias, Daniel M. Roy
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.19856
Pdf URL: https://arxiv.org/pdf/2503.19856
Copy Paste: [[2503.19856]] Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs(https://arxiv.org/abs/2503.19856)
Keywords: robust
Abstract: We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delay durations are revealed upfront each round) and/or ``preemptibility'' (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding $\smash{\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta\bigl(\sqrt{TK(1+d/C)+Td\log(K)}\bigr)$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\}\bigr)$ in the bandit setting, while in the full-information setting, the minimax regret is $\Theta\bigl(\sqrt{T(d+1)\log(K)}\bigr)$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel scheduling policies, based on Pareto-distributed proxy delays and batching techniques. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.

Title: NickPay, an Auditable, Privacy-Preserving, Nickname-Based Payment System

Authors: Guillaume Quispe, Pierre Jouvelot, Gerard Memmi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19872
Pdf URL: https://arxiv.org/pdf/2503.19872
Copy Paste: [[2503.19872]] NickPay, an Auditable, Privacy-Preserving, Nickname-Based Payment System(https://arxiv.org/abs/2503.19872)
Keywords: security, privacy
Abstract: In this paper, we describe the motivation, design, security properties, and a prototype implementation of NickPay, a new privacy-preserving yet auditable payment system built on top of the Ethereum blockchain platform. NickPay offers a strong level of privacy to participants and prevents successive payment transfers from being linked to their actual owners. It is providing the transparency that blockchains ensure and at the same time, preserving the possibility for a trusted authority to access sensitive information, e.g., for audit purposes or compliance with financial regulations. NickPay builds upon the Nicknames for Group Signatures (NGS) scheme, a new signing system based on dynamic ``nicknames'' for signers that extends the schemes of group signatures and signatures with flexible public keys. NGS enables identified group members to expose their flexible public keys, thus allowing direct and natural applications such as auditable private payment systems, NickPay being a blockchain-based prototype of these.

Title: CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation

Authors: Nengbo Wang, Xiaotian Han, Jagdip Singh, Jing Ma, Vipin Chaudhary
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.19878
Pdf URL: https://arxiv.org/pdf/2503.19878
Copy Paste: [[2503.19878]] CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation(https://arxiv.org/abs/2503.19878)
Keywords: large language model
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.

Title: Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Authors: Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19881
Pdf URL: https://arxiv.org/pdf/2503.19881
Copy Paste: [[2503.19881]] Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation(https://arxiv.org/abs/2503.19881)
Keywords: diffusion, transformer
Abstract: Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is this https URL.

Title: RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning

Authors: Abdulmoneam Ali, Ahmed Arafa
Subjects: cs.LG, cs.DC, cs.IT, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2503.19886
Pdf URL: https://arxiv.org/pdf/2503.19886
Copy Paste: [[2503.19886]] RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning(https://arxiv.org/abs/2503.19886)
Keywords: robust, federate
Abstract: We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models. The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar. A typical approach in the literature is to achieve this by training users' data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users' loss functions. This process is to be done iteratively until group identities converge. A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering. To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods. We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction.

Title: Scaling Down Text Encoders of Text-to-Image Diffusion Models

Authors: Lifu Wang, Daqing Liu, Xinchen Liu, Xiaodong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19897
Pdf URL: https://arxiv.org/pdf/2503.19897
Copy Paste: [[2503.19897]] Scaling Down Text Encoders of Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.19897)
Keywords: diffusion
Abstract: Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Title: CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Authors: Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hanchao Yu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.19900
Pdf URL: https://arxiv.org/pdf/2503.19900
Copy Paste: [[2503.19900]] CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning(https://arxiv.org/abs/2503.19900)
Keywords: generative
Abstract: The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Title: TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Authors: Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19901
Pdf URL: https://arxiv.org/pdf/2503.19901
Copy Paste: [[2503.19901]] TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization(https://arxiv.org/abs/2503.19901)
Keywords: transformer
Abstract: Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: this https URL

Title: ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

Authors: Fernando Julio Cendra, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19902
Pdf URL: https://arxiv.org/pdf/2503.19902
Copy Paste: [[2503.19902]] ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models(https://arxiv.org/abs/2503.19902)
Keywords: extraction, diffusion, generative
Abstract: The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilizes a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: this https URL

Title: Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Authors: Zihang Lai, Andrea Vedaldi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19904
Pdf URL: https://arxiv.org/pdf/2503.19904
Copy Paste: [[2503.19904]] Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better(https://arxiv.org/abs/2503.19904)
Keywords: transformer
Abstract: Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Title: AvatarArtist: Open-Domain 4D Avatarization

Authors: Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19906
Pdf URL: https://arxiv.org/pdf/2503.19906
Copy Paste: [[2503.19906]] AvatarArtist: Open-Domain 4D Avatarization(https://arxiv.org/abs/2503.19906)
Keywords: robust, diffusion, generative
Abstract: This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies..

Title: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Authors: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19907
Pdf URL: https://arxiv.org/pdf/2503.19907
Copy Paste: [[2503.19907]] FullDiT: Multi-Task Video Generative Foundation Model with Full Attention(https://arxiv.org/abs/2503.19907)
Keywords: generative
Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Title: CoLLM: A Large Language Model for Composed Image Retrieval

Authors: Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2503.19910
Pdf URL: https://arxiv.org/pdf/2503.19910
Copy Paste: [[2503.19910]] CoLLM: A Large Language Model for Composed Image Retrieval(https://arxiv.org/abs/2503.19910)
Keywords: large language model
Abstract: Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

Title: SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining

Authors: Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19912
Pdf URL: https://arxiv.org/pdf/2503.19912
Copy Paste: [[2503.19912]] SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining(https://arxiv.org/abs/2503.19912)
Keywords: robust
Abstract: LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at this https URL

Title: PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

Authors: Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19913
Pdf URL: https://arxiv.org/pdf/2503.19913
Copy Paste: [[2503.19913]] PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model(https://arxiv.org/abs/2503.19913)
Keywords: diffusion
Abstract: As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model's understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.

Title: Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Authors: Sangwon Beak, Hyeonwoo Kim, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19914
Pdf URL: https://arxiv.org/pdf/2503.19914
Copy Paste: [[2503.19914]] Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models(https://arxiv.org/abs/2503.19914)
Keywords: robust, diffusion
Abstract: We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.

Title: EventFly: Event Camera Perception from Ground to the Sky

Authors: Lingdong Kong, Dongyue Lu, Xiang Xu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19916
Pdf URL: https://arxiv.org/pdf/2503.19916
Copy Paste: [[2503.19916]] EventFly: Event Camera Perception from Ground to the Sky(https://arxiv.org/abs/2503.19916)
Keywords: robust
Abstract: Cross-platform adaptation in event-based dense perception is crucial for deploying event cameras across diverse settings, such as vehicles, drones, and quadrupeds, each with unique motion dynamics, viewpoints, and class distributions. In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. Our approach comprises three key components: i) Event Activation Prior (EAP), which identifies high-activation regions in the target domain to minimize prediction entropy, fostering confident, domain-adaptive predictions; ii) EventBlend, a data-mixing strategy that integrates source and target event voxel grids based on EAP-driven similarity and density maps, enhancing feature alignment; and iii) EventMatch, a dual-discriminator technique that aligns features from source, target, and blended domains for better domain-invariant learning. To holistically assess cross-platform adaptation abilities, we introduce EXPo, a large-scale benchmark with diverse samples across vehicle, drone, and quadruped platforms. Extensive experiments validate our effectiveness, demonstrating substantial gains over popular adaptation methods. We hope this work can pave the way for more adaptive, high-performing event perception across diverse and complex environments.