2025-03-14

Title: FedMSGL: A Self-Expressive Hypergraph Based Federated Multi-View Learning

Authors: Daoyuan Li, Zuyuan Yang, Shengli Xie
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.09643
Pdf URL: https://arxiv.org/pdf/2503.09643
Copy Paste: [[2503.09643]] FedMSGL: A Self-Expressive Hypergraph Based Federated Multi-View Learning(https://arxiv.org/abs/2503.09643)
Keywords: security, privacy, protect, federate
Abstract: Federated learning is essential for enabling collaborative model training across decentralized data sources while preserving data privacy and security. This approach mitigates the risks associated with centralized data collection and addresses concerns related to data ownership and compliance. Despite significant advancements in federated learning algorithms that address communication bottlenecks and enhance privacy protection, existing works overlook the impact of differences in data feature dimensions, resulting in global models that disproportionately depend on participants with large feature dimensions. Additionally, current single-view federated learning methods fail to account for the unique characteristics of multi-view data, leading to suboptimal performance in processing such data. To address these issues, we propose a Self-expressive Hypergraph Based Federated Multi-view Learning method (FedMSGL). The proposed method leverages self-expressive character in the local training to learn uniform dimension subspace with latent sample relation. At the central side, an adaptive fusion technique is employed to generate the global model, while constructing a hypergraph from the learned global and view-specific subspace to capture intricate interconnections across views. Experiments on multi-view datasets with different feature dimensions validated the effectiveness of the proposed method.

Title: Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference

Authors: Songlin Yang, Tao Yang, Bo Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09646
Pdf URL: https://arxiv.org/pdf/2503.09646
Copy Paste: [[2503.09646]] Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference(https://arxiv.org/abs/2503.09646)
Keywords: diffusion
Abstract: The deployment of sensors for air quality monitoring is constrained by high costs, leading to inadequate network coverage and data deficits in some areas. Utilizing existing observations, spatio-temporal kriging is a method for estimating air quality at unobserved locations during a specific period. Inductive spatio-temporal kriging with increment training strategy has demonstrated its effectiveness using virtual nodes to simulate unobserved nodes. However, a disparity between virtual and real nodes persists, complicating the application of learning patterns derived from virtual nodes to actual unobserved ones. To address these limitations, this paper presents a Physics-Guided Increment Training Strategy (PGITS). Specifically, we design a dynamic graph generation module to incorporate the advection and diffusion processes of airborne particles as physical knowledge into the graph structure, dynamically adjusting the adjacency matrix to reflect physical interactions between nodes. By using physics principles as a bridge between virtual and real nodes, this strategy ensures the features of virtual nodes and their pseudo labels are closer to actual nodes. Consequently, the learned patterns of virtual nodes can be applied to actual unobserved nodes for effective kriging.

Title: LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics

Authors: Jialiang Tang, Shuo Chen, Chen Gong, Jing Zhang, Dacheng Tao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09656
Pdf URL: https://arxiv.org/pdf/2503.09656
Copy Paste: [[2503.09656]] LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics(https://arxiv.org/abs/2503.09656)
Keywords: large language model
Abstract: Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring. Recent studies have revealed that Large Language Models (LLMs), with their powerful in-contextual modeling capabilities, hold significant potential for TSF. However, existing LLM-based methods usually perform suboptimally because they neglect the inherent characteristics of time series data. Unlike the textual data used in LLM pre-training, the time series data is semantically sparse and comprises distinctive temporal patterns. To address this problem, we propose LLM-PS to empower the LLM for TSF by learning the fundamental \textit{Patterns} and meaningful \textit{Semantics} from time series data. Our LLM-PS incorporates a new multi-scale convolutional neural network adept at capturing both short-term fluctuations and long-term trends within the time series. Meanwhile, we introduce a time-to-text module for extracting valuable semantics across continuous time intervals rather than isolated time points. By integrating these patterns and semantics, LLM-PS effectively models temporal dependencies, enabling a deep comprehension of time series and delivering accurate forecasts. Intensive experimental results demonstrate that LLM-PS achieves state-of-the-art performance in both short- and long-term forecasting tasks, as well as in few- and zero-shot settings.

Title: Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

Authors: Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09657
Pdf URL: https://arxiv.org/pdf/2503.09657
Copy Paste: [[2503.09657]] Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization(https://arxiv.org/abs/2503.09657)
Keywords: large language model
Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) but often struggles to maintain performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Global pruning has the potential to find the optimal solution although resource-intensive. However, existing methods tend to rank structural saliency uniformly, ignoring inter-structure dependencies and failing to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.

Title: Towards Robust Model Evolution with Algorithmic Recourse

Authors: Hao-Tsung Yang, Jie Gao, Bo-Yi Liu, Zhi-Xuan Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09658
Pdf URL: https://arxiv.org/pdf/2503.09658
Copy Paste: [[2503.09658]] Towards Robust Model Evolution with Algorithmic Recourse(https://arxiv.org/abs/2503.09658)
Keywords: robust
Abstract: Algorithmic Recourse is a way for users to modify their attributes to align with a model's expectations, thereby improving their outcomes after receiving unfavorable decisions. In real-world scenarios, users often need to strategically adjust their attributes to compete for limited resources. However, such strategic behavior induces users to "game" algorithms, causing model collapse due to distribution shifts. These shifts arise from user competition, resource constraints, and adaptive user responses. While prior research on Algorithmic Recourse has explored its effects on both systems and users, the impact of resource constraints and competition over time remains underexplored. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Through theoretical analysis and empirical evaluation, we identify three key phenomena that arise consistently in both synthetic and real-world datasets: escalating decision boundaries, non-robust model predictions, and inequitable recourse actions. Finally, we discuss the broader social implications of these findings and present two algorithmic strategies aimed at mitigating these challenges.

Title: Towards Hardware Supported Domain Generalization in DNN-Based Edge Computing Devices for Health Monitoring

Authors: Johnson Loh, Lyubov Dudchenko, Justus Viga, Tobias Gemmeke
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09661
Pdf URL: https://arxiv.org/pdf/2503.09661
Copy Paste: [[2503.09661]] Towards Hardware Supported Domain Generalization in DNN-Based Edge Computing Devices for Health Monitoring(https://arxiv.org/abs/2503.09661)
Keywords: robust
Abstract: Deep neural network (DNN) models have shown remarkable success in many real-world scenarios, such as object detection and classification. Unfortunately, these models are not yet widely adopted in health monitoring due to exceptionally high requirements for model robustness and deployment in highly resource-constrained devices. In particular, the acquisition of biosignals, such as electrocardiogram (ECG), is subject to large variations between training and deployment, necessitating domain generalization (DG) for robust classification quality across sensors and patients. The continuous monitoring of ECG also requires the execution of DNN models in convenient wearable devices, which is achieved by specialized ECG accelerators with small form factor and ultra-low power consumption. However, combining DG capabilities with ECG accelerators remains a challenge. This article provides a comprehensive overview of ECG accelerators and DG methods and discusses the implication of the combination of both domains, such that multi-domain ECG monitoring is enabled with emerging algorithm-hardware co-optimized systems. Within this context, an approach based on correction layers is proposed to deploy DG capabilities on the edge. Here, the DNN fine-tuning for unknown domains is limited to a single layer, while the remaining DNN model remains unmodified. Thus, computational complexity (CC) for DG is reduced with minimal memory overhead compared to conventional fine-tuning of the whole DNN model. The DNN model-dependent CC is reduced by more than 2.5x compared to DNN fine-tuning at an average increase of F1 score by more than 20% on the generalized target domain. In summary, this article provides a novel perspective on robust DNN classification on the edge for health monitoring applications.

Title: CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Authors: Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09662
Pdf URL: https://arxiv.org/pdf/2503.09662
Copy Paste: [[2503.09662]] CoRe^2: Collect, Reflect and Refine to Generate Better and Faster(https://arxiv.org/abs/2503.09662)
Keywords: diffusion, generative
Abstract: Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using this http URL is released at this https URL.

Title: Blockchain-Enabled Management Framework for Federated Coalition Networks

Authors: Jorge Álvaro González, Ana María Saiz García, Victor Monzon Baeza
Subjects: cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2503.09666
Pdf URL: https://arxiv.org/pdf/2503.09666
Copy Paste: [[2503.09666]] Blockchain-Enabled Management Framework for Federated Coalition Networks(https://arxiv.org/abs/2503.09666)
Keywords: secure, security, federate
Abstract: In a globalized and interconnected world, interoperability has become a key concept for advancing tactical scenarios. Federated Coalition Networks (FCN) enable cooperation between entities from multiple nations while allowing each to maintain control over their systems. However, this interoperability necessitates the sharing of increasing amounts of information between different tactical assets, raising the need for higher security measures. Emerging technologies like blockchain drive a revolution in secure communications, paving the way for new tactical scenarios. In this work, we propose a blockchain-based framework to enhance the resilience and security of the management of these networks. We offer a guide to FCN design to help a broad audience understand the military networks in international missions by a use case and key functions applied to a proposed architecture. We evaluate its effectiveness and performance in information encryption to validate this framework.

Title: Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models

Authors: Sangwon Jang, June Suk Choi, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09669
Pdf URL: https://arxiv.org/pdf/2503.09669
Copy Paste: [[2503.09669]] Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.09669)
Keywords: attack, steal, diffusion
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality contents from text prompts. However, their reliance on publicly available data and the growing trend of data sharing for fine-tuning make these models particularly vulnerable to data poisoning attacks. In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. We find that when certain visual patterns are repeatedly in the training data, the model learns to reproduce them naturally in its outputs, even without prompt mentions. Leveraging this, we develop an automated data poisoning algorithm that unobtrusively injects logos into original images, ensuring they blend naturally and remain undetected. Models trained on this poisoned dataset generate images containing logos without degrading image quality or text alignment. We experimentally validate our silent branding attack across two realistic settings on large-scale high-quality image datasets and style personalization datasets, achieving high success rates even without a specific text trigger. Human evaluation and quantitative metrics including logo detection show that our method can stealthily embed logos.

Title: Probabilistic Reasoning with LLMs for k-anonymity Estimation

Authors: Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09674
Pdf URL: https://arxiv.org/pdf/2503.09674
Copy Paste: [[2503.09674]] Probabilistic Reasoning with LLMs for k-anonymity Estimation(https://arxiv.org/abs/2503.09674)
Keywords: privacy
Abstract: Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a novel numerical reasoning task under uncertainty, focusing on estimating the k-anonymity of user-generated documents containing privacy-sensitive information. We propose BRANCH, which uses LLMs to factorize a joint probability distribution to estimate the k-value-the size of the population matching the given information-by modeling individual pieces of textual information as random variables. The probability of each factor occurring within a population is estimated using standalone LLMs or retrieval-augmented generation systems, and these probabilities are combined into a final k-value. Our experiments show that this method successfully estimates the correct k-value 67% of the time, an 11% increase compared to GPT-4o chain-of-thought reasoning. Additionally, we leverage LLM uncertainty to develop prediction intervals for k-anonymity, which include the correct value in nearly 92% of cases.

Title: Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

Authors: Shangwen Zhu, Han Zhang, Zhantao Yang, Qianyu Peng, Zhao Pu, Huangji Wang, Fan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09675
Pdf URL: https://arxiv.org/pdf/2503.09675
Copy Paste: [[2503.09675]] Accelerating Diffusion Sampling via Exploiting Local Transition Coherence(https://arxiv.org/abs/2503.09675)
Keywords: diffusion
Abstract: Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.

Title: Have LLMs Made Active Learning Obsolete? Surveying the NLP Community

Authors: Julia Romberg, Christopher Schröder, Julius Gonsior, Katrin Tomanek, Fredrik Olsson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09701
Pdf URL: https://arxiv.org/pdf/2503.09701
Copy Paste: [[2503.09701]] Have LLMs Made Active Learning Obsolete? Surveying the NLP Community(https://arxiv.org/abs/2503.09701)
Keywords: large language model
Abstract: Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, but have also improved methods such as few- or zero-shot learning, and text synthesis - thereby introducing potential alternatives. This raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles and expected future developments. Our findings show that annotated data remains a key factor, and active learning continues to be relevant. While the majority of active learning users find it effective, a comparison with a community survey from over a decade ago reveals persistent challenges: setup complexity, estimation of cost reduction, and tooling. We publish an anonymized version of the collected dataset

Title: Revisiting semi-supervised learning in the era of foundation models

Authors: Ping Zhang, Zheda Mai, Quang-Huy Nguyen, Wei-Lun Chao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09707
Pdf URL: https://arxiv.org/pdf/2503.09707
Copy Paste: [[2503.09707]] Revisiting semi-supervised learning in the era of foundation models(https://arxiv.org/abs/2503.09707)
Keywords: robust
Abstract: Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.

Title: Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain

Authors: Yuanmin Huang, Mi Zhang, Zhaoxiang Wang, Wenxuan Li, Min Yang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09712
Pdf URL: https://arxiv.org/pdf/2503.09712
Copy Paste: [[2503.09712]] Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain(https://arxiv.org/abs/2503.09712)
Keywords: attack, generative
Abstract: Time series classification (TSC) is a cornerstone of modern web applications, powering tasks such as financial data analysis, network traffic monitoring, and user behavior analysis. In recent years, deep neural networks (DNNs) have greatly enhanced the performance of TSC models in these critical domains. However, DNNs are vulnerable to backdoor attacks, where attackers can covertly implant triggers into models to induce malicious outcomes. Existing backdoor attacks targeting DNN-based TSC models remain elementary. In particular, early methods borrow trigger designs from computer vision, which are ineffective for time series data. More recent approaches utilize generative models for trigger generation, but at the cost of significant computational complexity. In this work, we analyze the limitations of existing attacks and introduce an enhanced method, FreqBack. Drawing inspiration from the fact that DNN models inherently capture frequency domain features in time series data, we identify that improper perturbations in the frequency domain are the root cause of ineffective attacks. To address this, we propose to generate triggers both effectively and efficiently, guided by frequency analysis. FreqBack exhibits substantial performance across five models and eight datasets, achieving an impressive attack success rate of over 90%, while maintaining less than a 3% drop in model accuracy on clean data.

Title: Towards Causal Model-Based Policy Optimization

Authors: Alberto Caron, Vasilios Mavroudis, Chris Hicks
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09719
Pdf URL: https://arxiv.org/pdf/2503.09719
Copy Paste: [[2503.09719]] Towards Causal Model-Based Policy Optimization(https://arxiv.org/abs/2503.09719)
Keywords: robust
Abstract: Real-world decision-making problems are often marked by complex, uncertain dynamics that can shift or break under changing conditions. Traditional Model-Based Reinforcement Learning (MBRL) approaches learn predictive models of environment dynamics from queried trajectories and then use these models to simulate rollouts for policy optimization. However, such methods do not account for the underlying causal mechanisms that govern the environment, and thus inadvertently capture spurious correlations, making them sensitive to distributional shifts and limiting their ability to generalize. The same naturally holds for model-free approaches. In this work, we introduce Causal Model-Based Policy Optimization (C-MBPO), a novel framework that integrates causal learning into the MBRL pipeline to achieve more robust, explainable, and generalizable policy learning algorithms. Our approach centers on first inferring a Causal Markov Decision Process (C-MDP) by learning a local Structural Causal Model (SCM) of both the state and reward transition dynamics from trajectories gathered online. C-MDPs differ from classic MDPs in that we can decompose causal dependencies in the environment dynamics via specifying an associated Causal Bayesian Network. C-MDPs allow for targeted interventions and counterfactual reasoning, enabling the agent to distinguish between mere statistical correlations and causal relationships. The learned SCM is then used to simulate counterfactual on-policy transitions and rewards under hypothetical actions (or ``interventions"), thereby guiding policy optimization more effectively. The resulting policy learned by C-MBPO can be shown to be robust to a class of distributional shifts that affect spurious, non-causal relationships in the dynamics. We demonstrate this through some simple experiments involving near and far OOD dynamics drifts.

Title: Finding the Muses: Identifying Coresets through Loss Trajectories

Authors: Manish Nagaraj, Deepak Ravikumar, Efstathia Soufleri, Kaushik Roy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09721
Pdf URL: https://arxiv.org/pdf/2503.09721
Copy Paste: [[2503.09721]] Finding the Muses: Identifying Coresets through Loss Trajectories(https://arxiv.org/abs/2503.09721)
Keywords: transformer
Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. $LTC$ quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, $LTC$ achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that $LTC$ consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (<2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.

Title: The Pitfalls of Imitation Learning when Actions are Continuous

Authors: Max Simchowitz, Daniel Pfrommer, Ali Jadbabaie
Subjects: cs.LG, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09722
Pdf URL: https://arxiv.org/pdf/2503.09722
Copy Paste: [[2503.09722]] The Pitfalls of Imitation Learning when Actions are Continuous(https://arxiv.org/abs/2503.09722)
Keywords: diffusion
Abstract: We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics are stable (i.e. contracting exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly "improper" imitator policies--those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity--or unless the expert trajectory distribution is sufficiently "spread." We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion Policies). We also establish a host of complementary negative and positive results for imitation in control systems.

Title: How Feasible is Augmenting Fake Nodes with Learnable Features as a Counter-strategy against Link Stealing Attacks?

Authors: Mir Imtiaz Mostafiz, Imtiaz Karim, Elisa Bertino
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09726
Pdf URL: https://arxiv.org/pdf/2503.09726
Copy Paste: [[2503.09726]] How Feasible is Augmenting Fake Nodes with Learnable Features as a Counter-strategy against Link Stealing Attacks?(https://arxiv.org/abs/2503.09726)
Keywords: privacy, protect, defense, attack, robust, steal
Abstract: Graph Neural Networks (GNNs) are widely used and deployed for graph-based prediction tasks. However, as good as GNNs are for learning graph data, they also come with the risk of privacy leakage. For instance, an attacker can run carefully crafted queries on the GNNs and, from the responses, can infer the existence of an edge between a pair of nodes. This attack, dubbed as a "link-stealing" attack, can jeopardize the user's privacy by leaking potentially sensitive information. To protect against this attack, we propose an approach called "$(N)$ode $(A)$ugmentation for $(R)$estricting $(G)$raphs from $(I)$nsinuating their $(S)$tructure" ($NARGIS$) and study its feasibility. $NARGIS$ is focused on reshaping the graph embedding space so that the posterior from the GNN model will still provide utility for the prediction task but will introduce ambiguity for the link-stealing attackers. To this end, $NARGIS$ applies spectral clustering on the given graph to facilitate it being augmented with new nodes -- that have learned features instead of fixed ones. It utilizes tri-level optimization for learning parameters for the GNN model, surrogate attacker model, and our defense model (i.e. learnable node features). We extensively evaluate $NARGIS$ on three benchmark citation datasets over eight knowledge availability settings for the attackers. We also evaluate the model fidelity and defense performance on influence-based link inference attacks. Through our studies, we have figured out the best feature of $NARGIS$ -- its superior fidelity-privacy performance trade-off in a significant number of cases. We also have discovered in which cases the model needs to be improved, and proposed ways to integrate different schemes to make the model more robust against link stealing attacks.

Title: All Your Knowledge Belongs to Us: Stealing Knowledge Graphs via Reasoning APIs

Authors: Zhaohan Xi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09727
Pdf URL: https://arxiv.org/pdf/2503.09727
Copy Paste: [[2503.09727]] All Your Knowledge Belongs to Us: Stealing Knowledge Graphs via Reasoning APIs(https://arxiv.org/abs/2503.09727)
Keywords: defense, attack, steal, extraction
Abstract: Knowledge graph reasoning (KGR), which answers complex, logical queries over large knowledge graphs (KGs), represents an important artificial intelligence task with a range of applications. Many KGs require extensive domain expertise and engineering effort to build and are hence considered proprietary within organizations and enterprises. Yet, spurred by their commercial and research potential, there is a growing trend to make KGR systems, (partially) built upon private KGs, publicly available through reasoning APIs. The inherent tension between maintaining the confidentiality of KGs while ensuring the accessibility to KGR systems motivates our study of KG extraction attacks: the adversary aims to "steal" the private segments of the backend KG, leveraging solely black-box access to the KGR API. Specifically, we present KGX, an attack that extracts confidential sub-KGs with high fidelity under limited query budgets. At a high level, KGX progressively and adaptively queries the KGR API and integrates the query responses to reconstruct the private sub-KG. This extraction remains viable even if any query responses related to the private sub-KG are filtered. We validate the efficacy of KGX against both experimental and real-world KGR APIs. Interestingly, we find that typical countermeasures (e.g., injecting noise into query responses) are often ineffective against KGX. Our findings suggest the need for a more principled approach to developing and deploying KGR systems, as well as devising new defenses against KG extraction attacks.

Title: I2V3D: Controllable image-to-video generation with 3D guidance

Authors: Zhiyuan Zhang, Dongdong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09733
Pdf URL: https://arxiv.org/pdf/2503.09733
Copy Paste: [[2503.09733]] I2V3D: Controllable image-to-video generation with 3D guidance(https://arxiv.org/abs/2503.09733)
Keywords: diffusion, generative
Abstract: We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Title: Enhancing Adversarial Example Detection Through Model Explanation

Authors: Qian Ma, Ziping Ye
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09735
Pdf URL: https://arxiv.org/pdf/2503.09735
Copy Paste: [[2503.09735]] Enhancing Adversarial Example Detection Through Model Explanation(https://arxiv.org/abs/2503.09735)
Keywords: defense, attack, robust
Abstract: Adversarial examples are a major problem for machine learning models, leading to a continuous search for effective defenses. One promising direction is to leverage model explanations to better understand and defend against these attacks. We looked at AmI, a method proposed by a NeurIPS 2018 spotlight paper that uses model explanations to detect adversarial examples. Our study shows that while AmI is a promising idea, its performance is too dependent on specific settings (e.g., hyperparameter) and external factors such as the operating system and the deep learning framework used, and such drawbacks limit AmI's practical usage. Our findings highlight the need for more robust defense mechanisms that are effective under various conditions. In addition, we advocate for a comprehensive evaluation framework for defense techniques.

Title: Unveiling Hidden Pivotal Players with GoalNet: A GNN-Based Soccer Player Evaluation System

Authors: Jacky Hao Jiang, Jerry Cai, Anastasios Kyrillidis
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2503.09737
Pdf URL: https://arxiv.org/pdf/2503.09737
Copy Paste: [[2503.09737]] Unveiling Hidden Pivotal Players with GoalNet: A GNN-Based Soccer Player Evaluation System(https://arxiv.org/abs/2503.09737)
Keywords: attack, robust, fair, transformer
Abstract: Soccer analysis tools emphasize metrics such as expected goals, leading to an overrepresentation of attacking players' contributions and overlooking players who facilitate ball control and link attacks. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we aim to identify players with pivotal roles in a soccer team, incorporating both spatial and temporal features. In this work, we introduce a GNN-based framework that assigns individual credit for changes in expected threat (xT), thus capturing overlooked yet vital contributions in soccer. Our pipeline encodes both spatial and temporal features in event-centric graphs, enabling fair attribution of non-scoring actions such as defensive or transitional plays. We incorporate centrality measures into the learned player embeddings, ensuring that ball-retaining defenders and defensive midfielders receive due recognition for their overall impact. Furthermore, we explore diverse GNN variants-including Graph Attention Networks and Transformer-based models-to handle long-range dependencies and evolving match contexts, discussing their relative performance and computational complexity. Experiments on real match data confirm the robustness of our approach in highlighting pivotal roles that traditional attacking metrics typically miss, underscoring the model's utility for more comprehensive soccer analytics.

Title: Review GIDE -- Restaurant Review Gastrointestinal Illness Detection and Extraction with Large Language Models

Authors: Timothy Laurence, Joshua Harris, Leo Loman, Amy Douglas, Yung-Wai Chan, Luke Hounsome, Lesley Larkin, Michael Borowitz
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09743
Pdf URL: https://arxiv.org/pdf/2503.09743
Copy Paste: [[2503.09743]] Review GIDE -- Restaurant Review Gastrointestinal Illness Detection and Extraction with Large Language Models(https://arxiv.org/abs/2503.09743)
Keywords: robust, extraction, large language model
Abstract: Foodborne gastrointestinal (GI) illness is a common cause of ill health in the UK. However, many cases do not interact with the healthcare system, posing significant challenges for traditional surveillance methods. The growth of publicly available online restaurant reviews and advancements in large language models (LLMs) present potential opportunities to extend disease surveillance by identifying public reports of GI illness. In this study, we introduce a novel annotation schema, developed with experts in GI illness, applied to the Yelp Open Dataset of reviews. Our annotations extend beyond binary disease detection, to include detailed extraction of information on symptoms and foods. We evaluate the performance of open-weight LLMs across these three tasks: GI illness detection, symptom extraction, and food extraction. We compare this performance to RoBERTa-based classification models fine-tuned specifically for these tasks. Our results show that using prompt-based approaches, LLMs achieve micro-F1 scores of over 90% for all three of our tasks. Using prompting alone, we achieve micro-F1 scores that exceed those of smaller fine-tuned models. We further demonstrate the robustness of LLMs in GI illness detection across three bias-focused experiments. Our results suggest that publicly available review text and LLMs offer substantial potential for public health surveillance of GI illness by enabling highly effective extraction of key information. While LLMs appear to exhibit minimal bias in processing, the inherent limitations of restaurant review data highlight the need for cautious interpretation of results.

Title: Solving Bayesian inverse problems with diffusion priors and off-policy RL

Authors: Luca Scimeca, Siddarth Venkatraman, Moksh Jain, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yashar Hezaveh, Laurence Perreault-Levasseur, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09746
Pdf URL: https://arxiv.org/pdf/2503.09746
Copy Paste: [[2503.09746]] Solving Bayesian inverse problems with diffusion priors and off-policy RL(https://arxiv.org/abs/2503.09746)
Keywords: diffusion
Abstract: This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

Title: SASNet: Spatially-Adaptive Sinusoidal Neural Networks

Authors: Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09750
Pdf URL: https://arxiv.org/pdf/2503.09750
Copy Paste: [[2503.09750]] SASNet: Spatially-Adaptive Sinusoidal Neural Networks(https://arxiv.org/abs/2503.09750)
Keywords: robust
Abstract: Sinusoidal neural networks (SNNs) have emerged as powerful implicit neural representations (INRs) for low-dimensional signals in computer vision and graphics. They enable high-frequency signal reconstruction and smooth manifold modeling; however, they often suffer from spectral bias, training instability, and overfitting. To address these challenges, we propose SASNet, Spatially-Adaptive SNNs that robustly enhance the capacity of compact INRs to fit detailed signals. SASNet integrates a frequency embedding layer to control frequency components and mitigate spectral bias, along with jointly optimized, spatially-adaptive masks that localize neuron influence, reducing network redundancy and improving convergence stability. Robust to hyperparameter selection, SASNet faithfully reconstructs high-frequency signals without overfitting low-frequency regions. Our experiments show that SASNet outperforms state-of-the-art INRs, achieving strong fitting accuracy, super-resolution capability, and noise suppression, without sacrificing model compactness.

Title: Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

Authors: Guangyi Liu, Suzan Iloglu, Michael Caldara, Joseph W. Durham, Michael M. Zavlanos
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09755
Pdf URL: https://arxiv.org/pdf/2503.09755
Copy Paste: [[2503.09755]] Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping(https://arxiv.org/abs/2503.09755)
Keywords: robust
Abstract: In Amazon robotic warehouses, the destination-to-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based predictor of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80\% in the simulation scenario.

Title: BiasConnect: Investigating Bias Interactions in Text-to-Image Models

Authors: Pushkar Shukla, Aditya Chinchure, Emily Diana, Alexander Tolbert, Kartik Hosanagar, Vineeth N. Balasubramanian, Leonid Sigal, Matthew A. Turk
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09763
Pdf URL: https://arxiv.org/pdf/2503.09763
Copy Paste: [[2503.09763]] BiasConnect: Investigating Bias Interactions in Text-to-Image Models(https://arxiv.org/abs/2503.09763)
Keywords: fair, generative
Abstract: The biases exhibited by Text-to-Image (TTI) models are often treated as if they are independent, but in reality, they may be deeply interrelated. Addressing bias along one dimension, such as ethnicity or age, can inadvertently influence another dimension, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. In this paper, we aim to address these questions by introducing BiasConnect, a novel tool designed to analyze and quantify bias interactions in TTI models. Our approach leverages a counterfactual-based framework to generate pairwise causal graphs that reveals the underlying structure of bias interactions for the given text prompt. Additionally, our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified. Our estimates have a strong correlation (+0.69) with the interdependency observations post bias mitigation. We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.

Title: Designing Graph Convolutional Neural Networks for Discrete Choice with Network Effects

Authors: Daniel F. Villarraga, Ricardo A. Daziano
Subjects: cs.LG, econ.EM, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09786
Pdf URL: https://arxiv.org/pdf/2503.09786
Copy Paste: [[2503.09786]] Designing Graph Convolutional Neural Networks for Discrete Choice with Network Effects(https://arxiv.org/abs/2503.09786)
Keywords: interpretability
Abstract: We introduce a novel model architecture that incorporates network effects into discrete choice problems, achieving higher predictive performance than standard discrete choice models while offering greater interpretability than general-purpose flexible model classes. Econometric discrete choice models aid in studying individual decision-making, where agents select the option with the highest reward from a discrete set of alternatives. Intuitively, the utility an individual derives from a particular choice depends on their personal preferences and characteristics, the attributes of the alternative, and the value their peers assign to that alternative or their previous choices. However, most applications ignore peer influence, and models that do consider peer or network effects often lack the flexibility and predictive performance of recently developed approaches to discrete choice, such as deep learning. We propose a novel graph convolutional neural network architecture to model network effects in discrete choices, achieving higher predictive performance than standard discrete choice models while retaining the interpretability necessary for inference--a quality often lacking in general-purpose deep learning architectures. We evaluate our architecture using revealed commuting choice data, extended with travel times and trip costs for each travel mode for work-related trips in New York City, as well as 2016 U.S. election data aggregated by county, to test its performance on datasets with highly imbalanced classes. Given the interpretability of our models, we can estimate relevant economic metrics, such as the value of travel time savings in New York City. Finally, we compare the predictive performance and behavioral insights from our architecture to those derived from traditional discrete choice and general-purpose deep learning models.

Title: Constrained Language Generation with Discrete Diffusion Models

Authors: Michael Cardei, Jacob K Christopher, Thomas Hartvigsen, Brian R. Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09790
Pdf URL: https://arxiv.org/pdf/2503.09790
Copy Paste: [[2503.09790]] Constrained Language Generation with Discrete Diffusion Models(https://arxiv.org/abs/2503.09790)
Keywords: diffusion
Abstract: Constraints are critical in text generation as LLM outputs are often unreliable when it comes to ensuring generated outputs adhere to user defined instruction or general safety guidelines. To address this gap, we present Constrained Discrete Diffusion (CDD), a novel method for enforcing constraints on natural language by integrating discrete diffusion models with differentiable optimization. Unlike conventional text generators, which often rely on post-hoc filtering or model retraining for controllable generation, we propose imposing constraints directly into the discrete diffusion sampling process. We illustrate how this technique can be applied to satisfy a variety of natural language constraints, including (i) toxicity mitigation by preventing harmful content from emerging, (ii) character and sequence level lexical constraints, and (iii) novel molecule sequence generation with specific property adherence. Experimental results show that our constraint-aware procedure achieves high fidelity in meeting these requirements while preserving fluency and semantic coherence, outperforming auto-regressive and existing discrete diffusion approaches.

Title: Minimal Time Series Transformer

Authors: Joni-Kristian Kämäräinen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09791
Pdf URL: https://arxiv.org/pdf/2503.09791
Copy Paste: [[2503.09791]] Minimal Time Series Transformer(https://arxiv.org/abs/2503.09791)
Keywords: transformer
Abstract: Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so that each sample becomes aware of all inputs and outputs. In sequence-to-sequence (Seq2Seq) modeling, the transformer processed samples become effective in predicting the next output. Time series forecasting is a Seq2Seq problem. The original architecture is defined for discrete input and output sequence tokens, but to adopt it for time series, the model must be adapted for continuous data. This work introduces minimal adaptations to make the original transformer architecture suitable for continuous value time series data.

Title: SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM

Authors: Benjamin Towle, Xin Chen, Ke Zhou
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09797
Pdf URL: https://arxiv.org/pdf/2503.09797
Copy Paste: [[2503.09797]] SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM(https://arxiv.org/abs/2503.09797)
Keywords: segmentation
Abstract: Pre-trained segmentation models are a powerful and flexible tool for segmenting images. Recently, this trend has extended to medical imaging. Yet, often these methods only produce a single prediction for a given image, neglecting inherent uncertainty in medical images, due to unclear object boundaries and errors caused by the annotation tool. Multiple Choice Learning is a technique for generating multiple masks, through multiple learned prediction heads. However, this cannot readily be extended to producing more outputs than its initial pre-training hyperparameters, as the sparse, winner-takes-all loss function makes it easy for one prediction head to become overly dominant, thus not guaranteeing the clinical relevancy of each mask produced. We introduce SeqSAM, a sequential, RNN-inspired approach to generating multiple masks, which uses a bipartite matching loss for ensuring the clinical relevancy of each mask, and can produce an arbitrary number of masks. We show notable improvements in quality of each mask produced across two publicly available datasets. Our code is available at this https URL.

Title: Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

Authors: Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, Arthur Douillard
Subjects: cs.LG, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2503.09799
Pdf URL: https://arxiv.org/pdf/2503.09799
Copy Paste: [[2503.09799]] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo(https://arxiv.org/abs/2503.09799)
Keywords: robust
Abstract: As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

Title: Evaluating the Impact of Synthetic Data on Object Detection Tasks in Autonomous Driving

Authors: Enes Özeren, Arka Bhowmick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09803
Pdf URL: https://arxiv.org/pdf/2503.09803
Copy Paste: [[2503.09803]] Evaluating the Impact of Synthetic Data on Object Detection Tasks in Autonomous Driving(https://arxiv.org/abs/2503.09803)
Keywords: robust
Abstract: The increasing applications of autonomous driving systems necessitates large-scale, high-quality datasets to ensure robust performance across diverse scenarios. Synthetic data has emerged as a viable solution to augment real-world datasets due to its cost-effectiveness, availability of precise ground-truth labels, and the ability to model specific edge cases. However, synthetic data may introduce distributional differences and biases that could impact model performance in real-world settings. To evaluate the utility and limitations of synthetic data, we conducted controlled experiments using multiple real-world datasets and a synthetic dataset generated by BIT Technology Solutions GmbH. Our study spans two sensor modalities, camera and LiDAR, and investigates both 2D and 3D object detection tasks. We compare models trained on real, synthetic, and mixed datasets, analyzing their robustness and generalization capabilities. Our findings demonstrate that the use of a combination of real and synthetic data improves the robustness and generalization of object detection models, underscoring the potential of synthetic data in advancing autonomous driving technologies.

Title: Temporal Difference Flows

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09817
Pdf URL: https://arxiv.org/pdf/2503.09817
Copy Paste: [[2503.09817]] Temporal Difference Flows(https://arxiv.org/abs/2503.09817)
Keywords: diffusion, generative
Abstract: Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

Title: Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval

Authors: Yuwei Zhang, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09819
Pdf URL: https://arxiv.org/pdf/2503.09819
Copy Paste: [[2503.09819]] Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval(https://arxiv.org/abs/2503.09819)
Keywords: large language model
Abstract: Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities, especially when handling complex reasoning tasks that require integrating information from multiple parts of a long context and performing multi-step reasoning. Although Chain-of-Thought (CoT) prompting has shown promise in reducing task complexity, our empirical analysis reveals that it does not fully resolve this limitation. Through controlled experiments, we identify poor recall of implicit facts as the primary cause of failure, which significantly hampers reasoning performance. Interestingly, we observe that the internal attention weights from the generated CoT tokens can effectively ground implicit facts, even when these facts are not explicitly recalled. Building on this insight, we propose a novel training-free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process. Additionally, we find that selecting context tokens from CoT tokens further improves performance. Our results demonstrate that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.

Title: Generative AI for Named Entity Recognition in Low-Resource Language Nepali

Authors: Sameer Neupane (University of Memphis), Jeevan Chapagain (University of Memphis), Nobal B. Niraula (Nowa Lab), Diwa Koirala (Nowa Lab)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09822
Pdf URL: https://arxiv.org/pdf/2503.09822
Copy Paste: [[2503.09822]] Generative AI for Named Entity Recognition in Low-Resource Language Nepali(https://arxiv.org/abs/2503.09822)
Keywords: generative, large language model
Abstract: Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), has significantly advanced Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), which involves identifying entities like person, location, and organization names in text. LLMs are especially promising for low-resource languages due to their ability to learn from limited data. However, the performance of GenAI models for Nepali, a low-resource language, has not been thoroughly evaluated. This paper investigates the application of state-of-the-art LLMs for Nepali NER, conducting experiments with various prompting techniques to assess their effectiveness. Our results provide insights into the challenges and opportunities of using LLMs for NER in low-resource settings and offer valuable contributions to the advancement of NLP research in languages like Nepali.

Title: Data Traceability for Privacy Alignment

Authors: Kevin Liao, Shreya Thipireddy, Daniel Weitzner
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2503.09823
Pdf URL: https://arxiv.org/pdf/2503.09823
Copy Paste: [[2503.09823]] Data Traceability for Privacy Alignment(https://arxiv.org/abs/2503.09823)
Keywords: security, privacy
Abstract: This paper offers a new privacy approach for the growing ecosystem of services--ranging from open banking to healthcare--dependent on sensitive personal data sharing between individuals and third-parties. While these services offer significant benefits, individuals want control over their data, transparency regarding how their data is used, and accountability from third-parties for misuse. However, existing legal and technical mechanisms are inadequate for supporting these needs. A comprehensive approach to the modern privacy challenges of accountable third-party data sharing requires a closer alignment of technical system architecture and legal institutional design. In order to achieve this privacy alignment, we extend traditional security threat modeling and analysis to encompass a broader range of privacy notions than has been typically considered. In particular, we introduce the concept of covert-accountability, which addresses adversaries that may act dishonestly but face potential identification and legal consequences. As a concrete instance of this design approach, we present the OTrace protocol, designed to provide traceable, accountable, consumer-control in third-party data sharing ecosystems. OTrace empowers consumers with the knowledge of where their data is, who has it, what it is being used for, and whom it is being shared with. By applying our alignment framework to OTrace, we demonstrate that OTrace's technical affordances can provide more confident, scalable regulatory oversight when combined with complementary legal mechanisms.

Title: Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Authors: Wenyi Lian, Joakim Lindblad, Patrick Micke, Nataša Sladoje
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09826
Pdf URL: https://arxiv.org/pdf/2503.09826
Copy Paste: [[2503.09826]] Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning(https://arxiv.org/abs/2503.09826)
Keywords: robust, transformer
Abstract: Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data.

Title: Resolution Invariant Autoencoder

Authors: Ashay Patel, Michela Antonelli, Sebastien Ourselin, M. Jorge Cardoso
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09828
Pdf URL: https://arxiv.org/pdf/2503.09828
Copy Paste: [[2503.09828]] Resolution Invariant Autoencoder(https://arxiv.org/abs/2503.09828)
Keywords: generative
Abstract: Deep learning has significantly advanced medical imaging analysis, yet variations in image resolution remain an overlooked challenge. Most methods address this by resampling images, leading to either information loss or computational inefficiencies. While solutions exist for specific tasks, no unified approach has been proposed. We introduce a resolution-invariant autoencoder that adapts spatial resizing at each layer in the network via a learned variable resizing process, replacing fixed spatial down/upsampling at the traditional factor of 2. This ensures a consistent latent space resolution, regardless of input or output resolution. Our model enables various downstream tasks to be performed on an image latent whilst maintaining performance across different resolutions, overcoming the shortfalls of traditional methods. We demonstrate its effectiveness in uncertainty-aware super-resolution, classification, and generative modelling tasks and show how our method outperforms conventional baselines with minimal performance loss across resolutions.

Title: Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation

Authors: Feng Zhou, Pu Cao, Yiyang Ma, Lu Yang, Jianqin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09830
Pdf URL: https://arxiv.org/pdf/2503.09830
Copy Paste: [[2503.09830]] Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation(https://arxiv.org/abs/2503.09830)
Keywords: diffusion, generative
Abstract: Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.

Title: A Comprehensive Review on Understanding the Decentralized and Collaborative Approach in Machine Learning

Authors: Sarwar Saif, Md Jahirul Islam, Md. Zihad Bin Jahangir, Parag Biswas, Abdur Rashid, MD Abdullah Al Nasim, Kishor Datta Gupta
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09833
Pdf URL: https://arxiv.org/pdf/2503.09833
Copy Paste: [[2503.09833]] A Comprehensive Review on Understanding the Decentralized and Collaborative Approach in Machine Learning(https://arxiv.org/abs/2503.09833)
Keywords: security, protect, federate, fair
Abstract: The arrival of Machine Learning (ML) completely changed how we can unlock valuable information from data. Traditional methods, where everything was stored in one place, had big problems with keeping information private, handling large amounts of data, and avoiding unfair advantages. Machine Learning has become a powerful tool that uses Artificial Intelligence (AI) to overcome these challenges. We started by learning the basics of Machine Learning, including the different types like supervised, unsupervised, and reinforcement learning. We also explored the important steps involved, such as preparing the data, choosing the right model, training it, and then checking its performance. Next, we examined some key challenges in Machine Learning, such as models learning too much from specific examples (overfitting), not learning enough (underfitting), and reflecting biases in the data used. Moving beyond centralized systems, we looked at decentralized Machine Learning and its benefits, like keeping data private, getting answers faster, and using a wider variety of data sources. We then focused on a specific type called federated learning, where models are trained without directly sharing sensitive information. Real-world examples from healthcare and finance were used to show how collaborative Machine Learning can solve important problems while still protecting information security. Finally, we discussed challenges like communication efficiency, dealing with different types of data, and security. We also explored using a Zero Trust framework, which provides an extra layer of protection for collaborative Machine Learning systems. This approach is paving the way for a bright future for this groundbreaking technology.

Title: Who Are You Behind the Screen? Implicit MBTI and Gender Detection Using Artificial Intelligence

Authors: Kourosh Shahnazari, Seyed Moein Ayyoubzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09853
Pdf URL: https://arxiv.org/pdf/2503.09853
Copy Paste: [[2503.09853]] Who Are You Behind the Screen? Implicit MBTI and Gender Detection Using Artificial Intelligence(https://arxiv.org/abs/2503.09853)
Keywords: transformer
Abstract: In personalized technology and psychological research, precisely detecting demographic features and personality traits from digital interactions becomes ever more important. This work investigates implicit categorization, inferring personality and gender variables directly from linguistic patterns in Telegram conversation data, while conventional personality prediction techniques mostly depend on explicitly self-reported labels. We refine a Transformer-based language model (RoBERTa) to capture complex linguistic cues indicative of personality traits and gender differences using a dataset comprising 138,866 messages from 1,602 users annotated with MBTI types and 195,016 messages from 2,598 users annotated with gender. Confidence levels help to greatly raise model accuracy to 86.16\%, hence proving RoBERTa's capacity to consistently identify implicit personality types from conversational text data. Our results highlight the usefulness of Transformer topologies for implicit personality and gender classification, hence stressing their efficiency and stressing important trade-offs between accuracy and coverage in realistic conversational environments. With regard to gender classification, the model obtained an accuracy of 74.4\%, therefore capturing gender-specific language patterns. Personality dimension analysis showed that people with introverted and intuitive preferences are especially more active in text-based interactions. This study emphasizes practical issues in balancing accuracy and data coverage as Transformer-based models show their efficiency in implicit personality and gender prediction tasks from conversational texts.

Title: Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis

Authors: Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, Jianming Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09860
Pdf URL: https://arxiv.org/pdf/2503.09860
Copy Paste: [[2503.09860]] Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis(https://arxiv.org/abs/2503.09860)
Keywords: robust, segmentation
Abstract: Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at this https URL.

Title: EquiPy: Sequential Fairness using Optimal Transport in Python

Authors: Agathe Fernandes Machado, Suzie Grondin, Philipp Ratz, Arthur Charpentier, François Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09866
Pdf URL: https://arxiv.org/pdf/2503.09866
Copy Paste: [[2503.09866]] EquiPy: Sequential Fairness using Optimal Transport in Python(https://arxiv.org/abs/2503.09866)
Keywords: fair
Abstract: Algorithmic fairness has received considerable attention due to the failures of various predictive AI systems that have been found to be unfairly biased against subgroups of the population. Many approaches have been proposed to mitigate such biases in predictive systems, however, they often struggle to provide accurate estimates and transparent correction mechanisms in the case where multiple sensitive variables, such as a combination of gender and race, are involved. This paper introduces a new open source Python package, EquiPy, which provides a easy-to-use and model agnostic toolbox for efficiently achieving fairness across multiple sensitive variables. It also offers comprehensive graphic utilities to enable the user to interpret the influence of each sensitive variable within a global context. EquiPy makes use of theoretical results that allow the complexity arising from the use of multiple variables to be broken down into easier-to-solve sub-problems. We demonstrate the ease of use for both mitigation and interpretation on publicly available data derived from the US Census and provide sample code for its use.

Title: LuciBot: Automated Robot Policy Learning from Generated Videos

Authors: Xiaowen Qiu, Yian Wang, Jiting Cai, Zhehuan Chen, Chunru Lin, Tsun-Hsuan Wang, Chuang Gan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09871
Pdf URL: https://arxiv.org/pdf/2503.09871
Copy Paste: [[2503.09871]] LuciBot: Automated Robot Policy Learning from Generated Videos(https://arxiv.org/abs/2503.09871)
Keywords: large language model, segmentation
Abstract: Automatically generating training supervision for embodied tasks is crucial, as manual designing is tedious and not scalable. While prior works use large language models (LLMs) or vision-language models (VLMs) to generate rewards, these approaches are largely limited to simple tasks with well-defined rewards, such as pick-and-place. This limitation arises because LLMs struggle to interpret complex scenes compressed into text or code due to their restricted input modality, while VLM-based rewards, though better at visual perception, remain limited by their less expressive output modality. To address these challenges, we leverage the imagination capability of general-purpose video generation models. Given an initial simulation frame and a textual task description, the video generation model produces a video demonstrating task completion with correct semantics. We then extract rich supervisory signals from the generated video, including 6D object pose sequences, 2D segmentations, and estimated depth, to facilitate task learning in simulation. Our approach significantly improves supervision quality for complex embodied tasks, enabling large-scale training in simulators.

Title: FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification

Authors: Shoaib Meraj Sami, Md Mahedi Hasan, Nasser M. Nasrabadi, Raghuveer Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09873
Pdf URL: https://arxiv.org/pdf/2503.09873
Copy Paste: [[2503.09873]] FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification(https://arxiv.org/abs/2503.09873)
Keywords: robust
Abstract: In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granularity gaps. In addition, the multi-sensor images can be misaligned due to intricate background clutters, fluctuating illumination conditions, and uncontrolled sensor settings. In this paper, to overcome these issues, we decompose, align, and fuse multiple image sensor data for target classification. We extract the domain-specific and domain-invariant features from each sensor data. We propose to develop a shared unified discrete token (UDT) space between sensors to reduce the domain and granularity gaps. Additionally, we develop an alignment module to overcome the misalignment between multi-sensors and emphasize the discriminative representation of the UDT space. In the alignment module, we introduce sparsity constraints to provide a better cross-modal representation of the UDT space and robustness against various sensor settings. We achieve superior classification performance compared to single-modality classifiers and several state-of-the-art multi-modal fusion algorithms on four multi-sensor ATR datasets.

Title: CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

Authors: Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09878
Pdf URL: https://arxiv.org/pdf/2503.09878
Copy Paste: [[2503.09878]] CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation(https://arxiv.org/abs/2503.09878)
Keywords: segmentation
Abstract: Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

Title: Tracking the Best Expert Privately

Authors: Aadirupa Saha, Vinod Raman, Hilal Asi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09889
Pdf URL: https://arxiv.org/pdf/2503.09889
Copy Paste: [[2503.09889]] Tracking the Best Expert Privately(https://arxiv.org/abs/2503.09889)
Keywords: privacy
Abstract: We design differentially private algorithms for the problem of prediction with expert advice under dynamic regret, also known as tracking the best expert. Our work addresses three natural types of adversaries, stochastic with shifting distributions, oblivious, and adaptive, and designs algorithms with sub-linear regret for all three cases. In particular, under a shifting stochastic adversary where the distribution may shift $S$ times, we provide an $\epsilon$-differentially private algorithm whose expected dynamic regret is at most $O\left( \sqrt{S T \log (NT)} + \frac{S \log (NT)}{\epsilon}\right)$, where $T$ and $N$ are the epsilon horizon and number of experts, respectively. For oblivious adversaries, we give a reduction from dynamic regret minimization to static regret minimization, resulting in an upper bound of $O\left(\sqrt{S T \log(NT)} + \frac{S T^{1/3}\log(T/\delta) \log(NT)}{\epsilon^{2/3}}\right)$ on the expected dynamic regret, where $S$ now denotes the allowable number of switches of the best expert. Finally, similar to static regret, we establish a fundamental separation between oblivious and adaptive adversaries for the dynamic setting: while our algorithms show that sub-linear regret is achievable for oblivious adversaries in the high-privacy regime $\epsilon \le \sqrt{S/T}$, we show that any $(\epsilon, \delta)$-differentially private algorithm must suffer linear dynamic regret under adaptive adversaries for $\epsilon \le \sqrt{S/T}$. Finally, to complement this lower bound, we give an $\epsilon$-differentially private algorithm that attains sub-linear dynamic regret under adaptive adversaries whenever $\epsilon \gg \sqrt{S/T}$.

Title: What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models

Authors: Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09894
Pdf URL: https://arxiv.org/pdf/2503.09894
Copy Paste: [[2503.09894]] What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models(https://arxiv.org/abs/2503.09894)
Keywords: large language model
Abstract: The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: this https URL.

Title: A Semantic-Loss Function Modeling Framework With Task-Oriented Machine Learning Perspectives

Authors: Ti Ti Nguyen, Thanh-Dung Le, Vu Nguyen Ha, Hong-fu Chou, Geoffrey Eappen, Duc-Dung Tran, Hung Nguyen-Kha, Prabhu Thiruvasagam, Luis M. Garces-Socarras, Jorge L. Gonzalez-Rios, Juan C. Merlano-Duncan, Symeon Chatzinotas
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.09903
Pdf URL: https://arxiv.org/pdf/2503.09903
Copy Paste: [[2503.09903]] A Semantic-Loss Function Modeling Framework With Task-Oriented Machine Learning Perspectives(https://arxiv.org/abs/2503.09903)
Keywords: extraction
Abstract: The integration of machine learning (ML) has significantly enhanced the capabilities of Earth Observation (EO) systems by enabling the extraction of actionable insights from complex datasets. However, the performance of data-driven EO applications is heavily influenced by the data collection and transmission processes, where limited satellite bandwidth and latency constraints can hinder the full transmission of original data to the receivers. To address this issue, adopting the concepts of Semantic Communication (SC) offers a promising solution by prioritizing the transmission of essential data semantics over raw information. Implementing SC for EO systems requires a thorough understanding of the impact of data processing and communication channel conditions on semantic loss at the processing center. This work proposes a novel data-fitting framework to empirically model the semantic loss using real-world EO datasets and domain-specific insights. The framework quantifies two primary types of semantic loss: (1) source coding loss, assessed via a data quality indicator measuring the impact of processing on raw source data, and (2) transmission loss, evaluated by comparing practical transmission performance against the Shannon limit. Semantic losses are estimated by evaluating the accuracy of EO applications using four task-oriented ML models, EfficientViT, MobileViT, ResNet50-DINO, and ResNet8-KD, on lossy image datasets under varying channel conditions and compression ratios. These results underpin a framework for efficient semantic-loss modeling in bandwidth-constrained EO scenarios, enabling more reliable and effective operations.

Title: eXpLogic: Explaining Logic Types and Patterns in DiffLogic Networks

Authors: Stephen Wormald, David Koblah, Matheus Kunzler Maldaner, Domenic Forte, Damon L. Woodard
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09910
Pdf URL: https://arxiv.org/pdf/2503.09910
Copy Paste: [[2503.09910]] eXpLogic: Explaining Logic Types and Patterns in DiffLogic Networks(https://arxiv.org/abs/2503.09910)
Keywords: defense, explainability
Abstract: Constraining deep neural networks (DNNs) to learn individual logic types per node, as performed using the DiffLogic network architecture, opens the door to model-specific explanation techniques that quell the complexity inherent to DNNs. Inspired by principles of circuit analysis from computer engineering, this work presents an algorithm (eXpLogic) for producing saliency maps which explain input patterns that activate certain functions. The eXpLogic explanations: (1) show the exact set of inputs responsible for a decision, which helps interpret false negative and false positive predictions, (2) highlight common input patterns that activate certain outputs, and (3) help reduce the network size to improve class-specific inference. To evaluate the eXpLogic saliency map, we introduce a metric that quantifies how much an input changes before switching a model's class prediction (the SwitchDist) and use this metric to compare eXpLogic against the Vanilla Gradients (VG) and Integrated Gradient (IG) methods. Generally, we show that eXpLogic saliency maps are better at predicting which inputs will change the class score. These maps help reduce the network size and inference times by 87\% and 8\%, respectively, while having a limited impact (-3.8\%) on class-specific predictions. The broader value of this work to machine learning is in demonstrating how certain DNN architectures promote explainability, which is relevant to healthcare, defense, and law.

Title: Inter-environmental world modeling for continuous and compositional dynamics

Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09911
Pdf URL: https://arxiv.org/pdf/2503.09911
Copy Paste: [[2503.09911]] Inter-environmental world modeling for continuous and compositional dynamics(https://arxiv.org/abs/2503.09911)
Keywords: generative
Abstract: Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Title: PluralLLM: Pluralistic Alignment in LLMs via Federated Learning

Authors: Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09925
Pdf URL: https://arxiv.org/pdf/2503.09925
Copy Paste: [[2503.09925]] PluralLLM: Pluralistic Alignment in LLMs via Federated Learning(https://arxiv.org/abs/2503.09925)
Keywords: privacy, federate, fair, transformer, large language model
Abstract: Ensuring Large Language Models (LLMs) align with diverse human preferences while preserving privacy and fairness remains a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized data collection, making them computationally expensive and privacy-invasive. We introduce PluralLLM a federated learning-based approach that enables multiple user groups to collaboratively train a transformer-based preference predictor without sharing sensitive data, which can also serve as a reward model for aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate preference updates efficiently, achieving 46% faster convergence, a 4% improvement in alignment scores, and nearly the same group fairness measure as in centralized training. Evaluated on a Q/A preference alignment task, PluralLLM demonstrates that federated preference learning offers a scalable and privacy-preserving alternative for aligning LLMs with diverse human values.

Title: VideoMerge: Towards Training-free Long Video Generation

Authors: Siyang Zhang, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09926
Pdf URL: https://arxiv.org/pdf/2503.09926
Copy Paste: [[2503.09926]] VideoMerge: Towards Training-free Long Video Generation(https://arxiv.org/abs/2503.09926)
Keywords: diffusion
Abstract: Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.

Title: Emotion Recognition with CLIP and Sequential Learning

Authors: Weiwei Zhou, Chenkun Ling, Zefeng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09929
Pdf URL: https://arxiv.org/pdf/2503.09929
Copy Paste: [[2503.09929]] Emotion Recognition with CLIP and Sequential Learning(https://arxiv.org/abs/2503.09929)
Keywords: robust, transformer
Abstract: Human emotion recognition plays a crucial role in facilitating seamless interactions between humans and computers. In this paper, we present our innovative methodology for tackling the Valence-Arousal (VA) Estimation Challenge, the Expression Recognition Challenge, and the Action Unit (AU) Detection Challenge, all within the framework of the 8th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Our approach introduces a novel framework aimed at enhancing continuous emotion recognition. This is achieved by fine-tuning the CLIP model with the aff-wild2 dataset, which provides annotated expression labels. The result is a fine-tuned model that serves as an efficient visual feature extractor, significantly improving its robustness. To further boost the performance of continuous emotion recognition, we incorporate Temporal Convolutional Network (TCN) modules alongside Transformer Encoder modules into our system architecture. The integration of these advanced components allows our model to outperform baseline performance, demonstrating its ability to recognize human emotions with greater accuracy and efficiency.

Title: PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Authors: Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin
Subjects: cs.CV, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09938
Pdf URL: https://arxiv.org/pdf/2503.09938
Copy Paste: [[2503.09938]] PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation(https://arxiv.org/abs/2503.09938)
Keywords: diffusion
Abstract: Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

Title: A Chaotic Image Encryption Scheme Using Novel Geometric Block Permutation and Dynamic Substitution

Authors: Muhammad Ali, Jawad Ahmad, Muhammad Abdullah Hussain Khan, Safee Ullah, Mujeeb Ur Rehman, Syed Aziz Shah, Muhammad Shahbaz Khan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09939
Pdf URL: https://arxiv.org/pdf/2503.09939
Copy Paste: [[2503.09939]] A Chaotic Image Encryption Scheme Using Novel Geometric Block Permutation and Dynamic Substitution(https://arxiv.org/abs/2503.09939)
Keywords: security, protect, robust, extraction, diffusion
Abstract: In this digital era, ensuring the security of digital data during transmission and storage is crucial. Digital data, particularly image data, needs to be protected against unauthorized access. To address this, this paper presents a novel image encryption scheme based on a confusion diffusion architecture. The diffusion module introduces a novel geometric block permutation technique, which effectively scrambles the pixels based on geometric shape extraction of pixels. The image is converted into four blocks, and pixels are extracted from these blocks using L-shape, U-shape, square-shape, and inverted U-shape patterns for each block, respectively. This robust extraction and permutation effectively disrupts the correlation within the image. Furthermore, the confusion module utilises bit-XOR and dynamic substitution techniques. For the bit-XOR operation, 2D Henon map has been utilised to generate a chaotic seed matrix, which is bit-XORed with the scrambled image. The resultant image then undergoes the dynamic substitution process to complete confusion phase. A statistical security analysis demonstrates the superior security of the proposed scheme, with being high uncertainty and unpredictability, achieving an entropy of 7.9974 and a correlation coefficient of 0.0014. These results validate the proposed scheme's effectiveness in securing digital images.

Title: TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness

Authors: Mu Chen, Wenyu Chen, Mingchuan Yang, Yuan Zhang, Tao Han, Xinchi Li, Yunlong Li, Huaici Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09941
Pdf URL: https://arxiv.org/pdf/2503.09941
Copy Paste: [[2503.09941]] TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness(https://arxiv.org/abs/2503.09941)
Keywords: robust, transformer
Abstract: 3D semantic occupancy has rapidly become a research focus in the fields of robotics and autonomous driving environment perception due to its ability to provide more realistic geometric perception and its closer integration with downstream tasks. By performing occupancy prediction of the 3D space in the environment, the ability and robustness of scene understanding can be effectively improved. However, existing occupancy prediction tasks are primarily modeled using voxel or point cloud-based approaches: voxel-based network structures often suffer from the loss of spatial information due to the voxelization process, while point cloud-based methods, although better at retaining spatial location information, face limitations in representing volumetric structural details. To address this issue, we propose a dual-modal prediction method based on 3D Gaussian sets and sparse points, which balances both spatial location and volumetric structural information, achieving higher accuracy in semantic occupancy prediction. Specifically, our method adopts a Transformer-based architecture, taking 3D Gaussian sets, sparse points, and queries as inputs. Through the multi-layer structure of the Transformer, the enhanced queries and 3D Gaussian sets jointly contribute to the semantic occupancy prediction, and an adaptive fusion mechanism integrates the semantic outputs of both modalities to generate the final prediction results. Additionally, to further improve accuracy, we dynamically refine the point cloud at each layer, allowing for more precise location information during occupancy prediction. We conducted experiments on the Occ3DnuScenes dataset, and the experimental results demonstrate superior performance of the proposed method on IoU based metrics.

Title: Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

Authors: Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09942
Pdf URL: https://arxiv.org/pdf/2503.09942
Copy Paste: [[2503.09942]] Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers(https://arxiv.org/abs/2503.09942)
Keywords: diffusion, transformer
Abstract: Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.

Title: Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

Authors: Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09947
Pdf URL: https://arxiv.org/pdf/2503.09947
Copy Paste: [[2503.09947]] Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction(https://arxiv.org/abs/2503.09947)
Keywords: robust, fair, interpretability
Abstract: Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning models, particularly Long Short-Term Memory (LSTM) networks, offer transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges including fairness, uncertainty, interpretability, robustness, generalizability, and reproducibility. In this work, we present the first comprehensive evaluation of trustworthiness in a continental-scale multi-task LSTM model predicting 20 water quality variables (encompassing physical/chemical processes, geochemical weathering, and nutrient cycling) across 482 U.S. basins. Our investigation uncovers systematic patterns of model performance disparities linked to basin characteristics, the inherent complexity of biogeochemical processes, and variable predictability, emphasizing critical performance fairness concerns. We further propose methodological frameworks for quantitatively evaluating critical aspects of trustworthiness, including uncertainty, interpretability, and robustness, identifying key limitations that could challenge reliable real-world deployment. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.

Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Authors: Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09949
Pdf URL: https://arxiv.org/pdf/2503.09949
Copy Paste: [[2503.09949]] UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?(https://arxiv.org/abs/2503.09949)
Keywords: generative, large language model
Abstract: With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at this https URL.

Title: Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking

Authors: Xinglong Sun, Haijiang Sun, Shan Jiang, Jiacheng Wang, Jiasong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09951
Pdf URL: https://arxiv.org/pdf/2503.09951
Copy Paste: [[2503.09951]] Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking(https://arxiv.org/abs/2503.09951)
Keywords: robust, transformer
Abstract: The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.

Title: X-Cross: Image Encryption Featuring Novel Dual-Layer Block Permutation and Dynamic Substitution Techniques

Authors: Hansa Ahsan, Safee Ullah, Jawad Ahmad, Aizaz Ahmad Khattak, Muhammad Ali, Muhammad Shahbaz Khan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09953
Pdf URL: https://arxiv.org/pdf/2503.09953
Copy Paste: [[2503.09953]] X-Cross: Image Encryption Featuring Novel Dual-Layer Block Permutation and Dynamic Substitution Techniques(https://arxiv.org/abs/2503.09953)
Keywords: security, attack, extraction, diffusion
Abstract: In this digital age, ensuring the security of digital data, especially the image data is critically important. Image encryption plays an important role in securing the online transmission/storage of images from unauthorized access. In this regard, this paper presents a novel diffusion-confusion-based image encryption algorithm named as X-CROSS. The diffusion phase involves a dual-layer block permutation. It involves a bit-level permutation termed Inter-Bit Transference (IBT) using a Bit-Extraction key, and pixel permutation with a unique X-crosspermutation algorithm to effectively scramble the pixels within an image. The proposed algorithm utilizes a resilient 2D chaotic map with non-linear dynamical behavior, assisting in generating complex Extraction Keys. After the permutation phase, the confusion phase proceeds with a dynamic substitution technique on the permuted images, establishing the final encryption layer. This combination of novel permutation and confusion results in the removal of the image's inherent patterns and increases its resistance to cyber-attacks. The close to ideal statistical security results for information entropy, correlation, homogeneity, contrast, and energy validate the proposed scheme's effectiveness in hiding the information within the image.

Title: Exploring Mutual Empowerment Between Wireless Networks and RL-based LLMs: A Survey

Authors: Yu Qiao, Phuong-Nam Tran, Ji Su Yoon, Loc X. Nguyen, Choong Seon Hong
Subjects: cs.LG, cs.AI, cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2503.09956
Pdf URL: https://arxiv.org/pdf/2503.09956
Copy Paste: [[2503.09956]] Exploring Mutual Empowerment Between Wireless Networks and RL-based LLMs: A Survey(https://arxiv.org/abs/2503.09956)
Keywords: large language model
Abstract: Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.

Title: Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment

Authors: Zhenyu Liu, Dongfang Li, Xinshuo Hu, Xinping Zhao, Yibin Chen, Baotian Hu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09958
Pdf URL: https://arxiv.org/pdf/2503.09958
Copy Paste: [[2503.09958]] Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment(https://arxiv.org/abs/2503.09958)
Keywords: transformer
Abstract: Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become this http URL by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further this http URL experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at this https URL.

Title: Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Authors: Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09962
Pdf URL: https://arxiv.org/pdf/2503.09962
Copy Paste: [[2503.09962]] Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification(https://arxiv.org/abs/2503.09962)
Keywords: large language model
Abstract: Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.

Title: ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content

Authors: Bhavik Chandna, Mariam Aboujenane, Usman Naseem
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09964
Pdf URL: https://arxiv.org/pdf/2503.09964
Copy Paste: [[2503.09964]] ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content(https://arxiv.org/abs/2503.09964)
Keywords: defense, attack, robust
Abstract: Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated extremist content, including photorealistic images and text, which can be used to bypass safety mechanisms and generate harmful outputs. However, existing datasets for evaluating LMM robustness offer limited exploration of extremist content, often lacking AI-generated images, diverse image generation models, and comprehensive coverage of historical events, which hinders a complete assessment of model vulnerabilities. To fill this gap, we introduce ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess LMM vulnerabilities against such content. ExtremeAIGC simulates real-world events and malicious use cases by curating diverse text- and image-based examples crafted using state-of-the-art image generation techniques. Our study reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge safety measures fail to prevent the generation of extremist material. We systematically quantify the success rates of various attack strategies, exposing critical gaps in current defenses and emphasizing the need for more robust mitigation strategies.

Title: Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework

Authors: Nathan Drenkow, Mitchell Pavlak, Keith Harrigian, Ayah Zirikly, Adarsh Subbaswamy, Mathias Unberath
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09969
Pdf URL: https://arxiv.org/pdf/2503.09969
Copy Paste: [[2503.09969]] Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework(https://arxiv.org/abs/2503.09969)
Keywords: protect
Abstract: Data-driven AI is establishing itself at the center of evidence-based medicine. However, reports of shortcomings and unexpected behavior are growing due to AI's reliance on association-based learning. A major reason for this behavior: latent bias in machine learning datasets can be amplified during training and/or hidden during testing. We present a data modality-agnostic auditing framework for generating targeted hypotheses about sources of bias which we refer to as Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets. Our method examines the relationship between task-level annotations and data properties including protected attributes (e.g., race, age, sex) and environment and acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT automatically quantifies the extent to which the observed data attributes may enable shortcut learning, or in the case of testing data, hide predictions made based on spurious associations. We demonstrate the broad applicability and value of our method by analyzing large-scale medical datasets for three distinct modalities and learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods that focus primarily on social and ethical objectives, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems. Our method paves the way for achieving deeper understanding of machine learning datasets throughout the AI development life-cycle from initial prototyping all the way to regulation, and creates opportunities to reduce model bias, enabling safer and more trustworthy AI systems.

Title: Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning

Authors: Jiaqi Wu, Junbiao Pang, Qingming Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09974
Pdf URL: https://arxiv.org/pdf/2503.09974
Copy Paste: [[2503.09974]] Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning(https://arxiv.org/abs/2503.09974)
Keywords: robust
Abstract: Current Semi-supervised Learning (SSL) adopts the pseudo-labeling strategy and further filters pseudo-labels based on confidence thresholds. However, this mechanism has notable drawbacks: 1) setting the reasonable threshold is an open problem which significantly influences the selection of the high-quality pseudo-labels; and 2) deep models often exhibit the over-confidence phenomenon which makes the confidence value an unreliable indicator for assessing the quality of pseudo-labels due to the scarcity of labeled data. In this paper, we propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples. We further model the utility of pseudo-labels as long-tailed weights to avoid the open problem of setting the threshold. Concretely, the advantage of the long-tailed weights ensures that even unreliable pseudo-labels still contribute to enhancing the model's robustness. Besides, UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression. Experimental results demonstrate that combining the proposed method with DualPose leads to a 3.47% improvement in Percentage of Correct Keypoints (PCK) on the Sniffing dataset with 100 data points (30 labeled), a 7.29\% improvement in PCK on the FLIC dataset with 100 data points (50 labeled), and a 3.91% improvement in PCK on the LSP dataset with 200 data points (100 labeled). Furthermore, when combined with FixMatch, the proposed method achieves a 0.2% accuracy improvement on the CIFAR-10 dataset with 40 labeled data points and a 0.26% accuracy improvement on the CIFAR-100 dataset with 400 labeled data points.

Title: From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs

Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09986
Pdf URL: https://arxiv.org/pdf/2503.09986
Copy Paste: [[2503.09986]] From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs(https://arxiv.org/abs/2503.09986)
Keywords: large language model
Abstract: Motivated by the remarkable success of artificial intelligence (AI) across diverse fields, the application of AI to solve scientific problems-often formulated as partial differential equations (PDEs)-has garnered increasing attention. While most existing research concentrates on theoretical properties (such as well-posedness, regularity, and continuity) of the solutions, alongside direct AI-driven methods for solving PDEs, the challenge of uncovering symbolic relationships within these equations remains largely unexplored. In this paper, we propose leveraging large language models (LLMs) to learn such symbolic relationships. Our results demonstrate that LLMs can effectively predict the operators involved in PDE solutions by utilizing the symbolic information in the PDEs. Furthermore, we show that discovering these symbolic relationships can substantially improve both the efficiency and accuracy of the finite expression method for finding analytical approximation of PDE solutions, delivering a fully interpretable solution pipeline. This work opens new avenues for understanding the symbolic structure of scientific problems and advancing their solution processes.

Title: Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes

Authors: JunYong Choi, Min-Cheol Sagong, SeokYeong Lee, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09993
Pdf URL: https://arxiv.org/pdf/2503.09993
Copy Paste: [[2503.09993]] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes(https://arxiv.org/abs/2503.09993)
Keywords: diffusion, generative
Abstract: We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, finding a single accurate solution and generating diverse solutions can be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. The experimental results demonstrate the superiority of our two models in terms of both diversity and accuracy, which translates to enhanced performance in downstream applications such as object insertion and material editing.

Title: TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

Authors: Yunxiao Wang, Meng Liu, Rui Shao, Haoyu Zhang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09994
Pdf URL: https://arxiv.org/pdf/2503.09994
Copy Paste: [[2503.09994]] TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs(https://arxiv.org/abs/2503.09994)
Keywords: large language model
Abstract: Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

Title: MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation

Authors: Shu Wang, Yanbo Gao, Shuai Li, Chong Lv, Xun Cai, Chuankun Li, Hui Yuan, Jinglin Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10000
Pdf URL: https://arxiv.org/pdf/2503.10000
Copy Paste: [[2503.10000]] MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation(https://arxiv.org/abs/2503.10000)
Keywords: robust
Abstract: This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability. Code is available at this https URL}{this https URL.

Title: One-Shot Federated Unsupervised Domain Adaptation with Scaled Entropy Attention and Multi-Source Smoothed Pseudo Labeling

Authors: Ali Abedi, Q. M. Jonathan Wu, Ning Zhang, Farhad Pourpanah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10020
Pdf URL: https://arxiv.org/pdf/2503.10020
Copy Paste: [[2503.10020]] One-Shot Federated Unsupervised Domain Adaptation with Scaled Entropy Attention and Multi-Source Smoothed Pseudo Labeling(https://arxiv.org/abs/2503.10020)
Keywords: privacy, federate
Abstract: Federated Learning (FL) is a promising approach for privacy-preserving collaborative learning. However, it faces significant challenges when dealing with domain shifts, especially when each client has access only to its source data and cannot share it during target domain adaptation. Moreover, FL methods often require high communication overhead due to multiple rounds of model updates between clients and the server. We propose a one-shot Federated Unsupervised Domain Adaptation (FUDA) method to address these limitations. Specifically, we introduce Scaled Entropy Attention (SEA) for model aggregation and Multi-Source Pseudo Labeling (MSPL) for target domain adaptation. SEA uses scaled prediction entropy on target domain to assign higher attention to reliable models. This improves the global model quality and ensures balanced weighting of contributions. MSPL distills knowledge from multiple source models to generate pseudo labels and manage noisy labels using smoothed soft-label cross-entropy (SSCE). Our approach outperforms state-of-the-art methods across four standard benchmarks while reducing communication and computation costs, making it highly suitable for real-world applications. The implementation code will be made publicly available upon publication.

Title: Using Context to Improve Word Segmentation

Authors: Stephanie Hu, Xiaolu Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10023
Pdf URL: https://arxiv.org/pdf/2503.10023
Copy Paste: [[2503.10023]] Using Context to Improve Word Segmentation(https://arxiv.org/abs/2503.10023)
Keywords: segmentation
Abstract: An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.

Title: Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models

Authors: Sina Malakouti, Adriana Kovashka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10037
Pdf URL: https://arxiv.org/pdf/2503.10037
Copy Paste: [[2503.10037]] Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.10037)
Keywords: diffusion
Abstract: Text-to-image diffusion models consistently fail at generating counter-stereotypical action relationships (e.g., "mouse chasing cat"), defaulting to frequent stereotypes even when explicitly prompted otherwise. Through systematic investigation, we discover this limitation stems from distributional biases rather than inherent model constraints. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"). To test this hypothesis, we develop a Role-Bridging Decomposition framework that leverages these intermediates to gradually teach rare relationships without architectural modifications. We introduce ActionBench, a comprehensive benchmark specifically designed to evaluate action-based relationship generation across stereotypical and counter-stereotypical configurations. Our experiments validate that intermediate compositions indeed facilitate counter-stereotypical generation, with both automatic metrics and human evaluations showing significant improvements over existing approaches. This work not only identifies fundamental biases in current text-to-image systems but demonstrates a promising direction for addressing them through compositional reasoning.

Title: How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Authors: Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10042
Pdf URL: https://arxiv.org/pdf/2503.10042
Copy Paste: [[2503.10042]] How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game(https://arxiv.org/abs/2503.10042)
Keywords: large language model
Abstract: The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.

Title: FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution

Authors: Wenjie Li, Heng Guo, Yuefeng Hou, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10043
Pdf URL: https://arxiv.org/pdf/2503.10043
Copy Paste: [[2503.10043]] FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.10043)
Keywords: transformer
Abstract: Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of x4, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.

Title: Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy

Authors: Ziqi Jia, Junjie Li, Xiaoyang Qu, Jianzong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10049
Pdf URL: https://arxiv.org/pdf/2503.10049
Copy Paste: [[2503.10049]] Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy(https://arxiv.org/abs/2503.10049)
Keywords: large language model
Abstract: Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.

Title: Deep Learning Approaches for Anti-Money Laundering on Mobile Transactions: Review, Framework, and Directions

Authors: Jiani Fan, Lwin Khin Shar, Ruichen Zhang, Ziyao Liu, Wenzhuo Yang, Dusit Niyato, Bomin Mao, Kwok-Yan Lam
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10058
Pdf URL: https://arxiv.org/pdf/2503.10058
Copy Paste: [[2503.10058]] Deep Learning Approaches for Anti-Money Laundering on Mobile Transactions: Review, Framework, and Directions(https://arxiv.org/abs/2503.10058)
Keywords: privacy
Abstract: Money laundering is a financial crime that obscures the origin of illicit funds, necessitating the development and enforcement of anti-money laundering (AML) policies by governments and organizations. The proliferation of mobile payment platforms and smart IoT devices has significantly complicated AML investigations. As payment networks become more interconnected, there is an increasing need for efficient real-time detection to process large volumes of transaction data on heterogeneous payment systems by different operators such as digital currencies, cryptocurrencies and account-based payments. Most of these mobile payment networks are supported by connected devices, many of which are considered loT devices in the FinTech space that constantly generate data. Furthermore, the growing complexity and unpredictability of transaction patterns across these networks contribute to a higher incidence of false positives. While machine learning solutions have the potential to enhance detection efficiency, their application in AML faces unique challenges, such as addressing privacy concerns tied to sensitive financial data and managing the real-world constraint of limited data availability due to data regulations. Existing surveys in the AML literature broadly review machine learning approaches for money laundering detection, but they often lack an in-depth exploration of advanced deep learning techniques - an emerging field with significant potential. To address this gap, this paper conducts a comprehensive review of deep learning solutions and the challenges associated with their use in AML. Additionally, we propose a novel framework that applies the least-privilege principle by integrating machine learning techniques, codifying AML red flags, and employing account profiling to provide context for predictions and enable effective fraud detection under limited data availability....

Title: Provably Secure Covert Messaging Using Image-based Diffusion Processes

Authors: Luke A. Bauer, Wenxuan Bao, Vincent Bindschaedler
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10063
Pdf URL: https://arxiv.org/pdf/2503.10063
Copy Paste: [[2503.10063]] Provably Secure Covert Messaging Using Image-based Diffusion Processes(https://arxiv.org/abs/2503.10063)
Keywords: secure, robust, diffusion
Abstract: We consider the problem of securely and robustly embedding covert messages into an image-based diffusion model's output. The sender and receiver want to exchange the maximum amount of information possible per diffusion sampled image while remaining undetected. The adversary wants to detect that such communication is taking place by identifying those diffusion samples that contain covert messages. To maximize robustness to transformations of the diffusion sample, a strategy is for the sender and the receiver to embed the message in the initial latents. We first show that prior work that attempted this is easily broken because their embedding technique alters the latents' distribution. We then propose a straightforward method to embed covert messages in the initial latent {\em without} altering the distribution. We prove that our construction achieves indistinguishability to any probabilistic polynomial time adversary. Finally, we discuss and analyze empirically the tradeoffs between embedding capacity, message recovery rates, and robustness. We find that optimizing the inversion method for error correction is crucial for reliability.

Title: Demoting Security via Exploitation of Cache Demote Operation in Intel's Latest ISA Extension

Authors: Taehun Kim, Hyerean Jang, Youngjoo Shin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10074
Pdf URL: https://arxiv.org/pdf/2503.10074
Copy Paste: [[2503.10074]] Demoting Security via Exploitation of Cache Demote Operation in Intel's Latest ISA Extension(https://arxiv.org/abs/2503.10074)
Keywords: security, attack
Abstract: ISA extensions are increasingly adopted to boost the performance of specialized workloads without requiring an entire architectural redesign. However, these enhancements can inadvertently expose new attack surfaces in the microarchitecture. In this paper, we investigate Intel's recently introduced cldemote extension, which promotes efficient data sharing by transferring cache lines from upper-level caches to the Last Level Cache (LLC). Despite its performance benefits, we uncover critical properties-unprivileged access, inter-cache state transition, and fault suppression-that render cldemote exploitable for microarchitectural attacks. We propose two new attack primitives, Flush+Demote and Demote+Time, built on our analysis. Flush+Demote constructs a covert channel with a bandwidth of 2.84 Mbps and a bit error rate of 0.018%, while Demote+Time derandomizes the kernel base address in 2.49 ms on Linux. Furthermore, we show that leveraging cldemote accelerates eviction set construction in non-inclusive LLC designs by obviating the need for helper threads or extensive cache conflicts, thereby reducing construction time by 36% yet retaining comparable success rates. Finally, we examine how ISA extensions contribute to broader microarchitectural attacks, identifying five key exploitable characteristics and categorizing four distinct attack types. We also discuss potential countermeasures, highlighting the far-reaching security implications of emerging ISA extensions.

Title: Image Quality Assessment: From Human to Machine Preference

Authors: Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10078
Pdf URL: https://arxiv.org/pdf/2503.10078
Copy Paste: [[2503.10078]] Image Quality Assessment: From Human to Machine Preference(https://arxiv.org/abs/2503.10078)
Keywords: segmentation
Abstract: Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: this https URL.

Title: Information Density Principle for MLLM Benchmarks

Authors: Chunyi Li, Xiaozhe Li, Zicheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10079
Pdf URL: https://arxiv.org/pdf/2503.10079
Copy Paste: [[2503.10079]] Information Density Principle for MLLM Benchmarks(https://arxiv.org/abs/2503.10079)
Keywords: large language model
Abstract: With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks. Project page: this https URL

Title: AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption

Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-eui Yoon
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10081
Pdf URL: https://arxiv.org/pdf/2503.10081
Copy Paste: [[2503.10081]] AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption(https://arxiv.org/abs/2503.10081)
Keywords: protect, robust, diffusion, generative
Abstract: The outstanding capability of diffusion models in generating high-quality images poses significant threats when misused by adversaries. In particular, we assume malicious adversaries exploiting diffusion models for inpainting tasks, such as replacing a specific region with a celebrity. While existing methods for protecting images from manipulation in diffusion-based generative models have primarily focused on image-to-image and text-to-image tasks, the challenge of preventing unauthorized inpainting has been rarely addressed, often resulting in suboptimal protection performance. To mitigate inpainting abuses, we propose ADVPAINT, a novel defensive framework that generates adversarial perturbations that effectively disrupt the adversary's inpainting tasks. ADVPAINT targets the self- and cross-attention blocks in a target diffusion inpainting model to distract semantic understanding and prompt interactions during image generation. ADVPAINT also employs a two-stage perturbation strategy, dividing the perturbation region based on an enlarged bounding box around the object, enhancing robustness across diverse masks of varying shapes and sizes. Our experimental results demonstrate that ADVPAINT's perturbations are highly effective in disrupting the adversary's inpainting tasks, outperforming existing methods; ADVPAINT attains over a 100-point increase in FID and substantial decreases in precision.

Title: Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity, its Interaction with Answer Space During CoT Reasoning with LLMs: A Recurrent Perspective

Authors: Xiang Zhang, Juntai Cao, Jiaqi Wei, Chenyu You, Dujian Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10084
Pdf URL: https://arxiv.org/pdf/2503.10084
Copy Paste: [[2503.10084]] Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity, its Interaction with Answer Space During CoT Reasoning with LLMs: A Recurrent Perspective(https://arxiv.org/abs/2503.10084)
Keywords: transformer, large language model
Abstract: Despite the remarkable successes of Large Language Models (LLMs), their fundamental Transformer architecture possesses inherent theoretical limitations that restrict their capability to handle reasoning tasks with increasing computational complexity. Chain-of-Thought (CoT) prompting has emerged as a practical solution, supported by several theoretical studies. However, current CoT-based methods (including ToT, GoT, etc.) generally adopt a "one-prompt-fits-all" strategy, using fixed templates (e.g., "think step by step") across diverse reasoning tasks. This method forces models to navigate an extremely complex prompt space to identify effective reasoning paths. The current prompt designing research are also heavily relying on trial-and-error rather than theoretically informed guidance. In this paper, we provide a rigorous theoretical analysis of the complexity and interplay between two crucial spaces: the prompt space (the space of potential prompt structures) and the answer space (the space of reasoning solutions generated by LLMs) in CoT reasoning. We demonstrate how reliance on a single universal prompt (e.g. think step by step) can negatively impact the theoretical computability of LLMs, illustrating that prompt complexity directly influences the structure and effectiveness of the navigation in answer space. Our analysis highlights that sometimes human supervision is critical for efficiently navigating the prompt space. We theoretically and empirically show that task-specific prompting significantly outperforms unsupervised prompt generation, emphasizing the necessity of thoughtful human guidance in CoT prompting.

Title: Enhanced Route Planning with Calibrated Uncertainty Set

Authors: Lingxuan Tang, Rui Luo, Zhixin Zhou, Nicolo Colombo
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10088
Pdf URL: https://arxiv.org/pdf/2503.10088
Copy Paste: [[2503.10088]] Enhanced Route Planning with Calibrated Uncertainty Set(https://arxiv.org/abs/2503.10088)
Keywords: robust
Abstract: This paper investigates the application of probabilistic prediction methodologies in route planning within a road network context. Specifically, we introduce the Conformalized Quantile Regression for Graph Autoencoders (CQR-GAE), which leverages the conformal prediction technique to offer a coverage guarantee, thus improving the reliability and robustness of our predictions. By incorporating uncertainty sets derived from CQR-GAE, we substantially improve the decision-making process in route planning under a robust optimization framework. We demonstrate the effectiveness of our approach by applying the CQR-GAE model to a real-world traffic scenario. The results indicate that our model significantly outperforms baseline methods, offering a promising avenue for advancing intelligent transportation systems.

Title: Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

Authors: Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10093
Pdf URL: https://arxiv.org/pdf/2503.10093
Copy Paste: [[2503.10093]] Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model(https://arxiv.org/abs/2503.10093)
Keywords: large language model
Abstract: Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the transformation of the sampling process from the target policy into a re-ranking of preference data. Building on this hypothesis, We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preferences reordering. Extensive experimental results and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while reducing about 300x computational overheads.

Title: Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text

Authors: Avinash Patil, Amardeep Kour Gedhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10095
Pdf URL: https://arxiv.org/pdf/2503.10095
Copy Paste: [[2503.10095]] Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text(https://arxiv.org/abs/2503.10095)
Keywords: robust, interpretability, transformer, large language model
Abstract: Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

Title: Semantic Latent Motion for Portrait Video Generation

Authors: Qiyuan Zhang, Chenyu Wu, Wenzhang Sun, Huaize Liu, Donglin Di, Wei Chen, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10096
Pdf URL: https://arxiv.org/pdf/2503.10096
Copy Paste: [[2503.10096]] Semantic Latent Motion for Portrait Video Generation(https://arxiv.org/abs/2503.10096)
Keywords: generative
Abstract: Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject's motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.

Title: SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning

Authors: Tianhao Peng, Xuhong Li, Haitao Yuan, Yuchen Li, Haoyi Xiong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10100
Pdf URL: https://arxiv.org/pdf/2503.10100
Copy Paste: [[2503.10100]] SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning(https://arxiv.org/abs/2503.10100)
Keywords: robust
Abstract: Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intra-subgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach over the state-of-the-art in GCL.

Title: Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation

Authors: Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10103
Pdf URL: https://arxiv.org/pdf/2503.10103
Copy Paste: [[2503.10103]] Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation(https://arxiv.org/abs/2503.10103)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at \href{this https URL}{this https URL\_inverse\_problem}.

Title: Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space

Authors: Yuheng Liang, Zheyu Wang, Feng Liu, Mingzhou Liu, Yu Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10104
Pdf URL: https://arxiv.org/pdf/2503.10104
Copy Paste: [[2503.10104]] Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space(https://arxiv.org/abs/2503.10104)
Keywords: robust
Abstract: Continuous Emotion Recognition (CER) plays a crucial role in intelligent human-computer interaction, mental health monitoring, and autonomous driving. Emotion modeling based on the Valence-Arousal (VA) space enables a more nuanced representation of emotional states. However, existing methods still face challenges in handling long-term dependencies and capturing complex temporal dynamics. To address these issues, this paper proposes a novel emotion recognition model, Mamba-VA, which leverages the Mamba architecture to efficiently model sequential emotional variations in video frames. First, the model employs a Masked Autoencoder (MAE) to extract deep visual features from video frames, enhancing the robustness of temporal information. Then, a Temporal Convolutional Network (TCN) is utilized for temporal modeling to capture local temporal dependencies. Subsequently, Mamba is applied for long-sequence modeling, enabling the learning of global emotional trends. Finally, a fully connected (FC) layer performs regression analysis to predict continuous valence and arousal values. Experimental results on the Valence-Arousal (VA) Estimation task of the 8th competition on Affective Behavior Analysis in-the-wild (ABAW) demonstrate that the proposed model achieves valence and arousal scores of 0.5362 (0.5036) and 0.4310 (0.4119) on the validation (test) set, respectively, outperforming the baseline. The source code is available on GitHub:this https URL.

Title: MoEdit: On Learning Quantity Perception for Multi-object Image Editing

Authors: Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10112
Pdf URL: https://arxiv.org/pdf/2503.10112
Copy Paste: [[2503.10112]] MoEdit: On Learning Quantity Perception for Multi-object Image Editing(https://arxiv.org/abs/2503.10112)
Keywords: diffusion
Abstract: Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at this https URL.

Title: Hybrid Agents for Image Restoration

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10120
Pdf URL: https://arxiv.org/pdf/2503.10120
Copy Paste: [[2503.10120]] Hybrid Agents for Image Restoration(https://arxiv.org/abs/2503.10120)
Keywords: large language model
Abstract: Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking the cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.

Title: Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

Authors: Yi Wu, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10125
Pdf URL: https://arxiv.org/pdf/2503.10125
Copy Paste: [[2503.10125]] Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation(https://arxiv.org/abs/2503.10125)
Keywords: diffusion, transformer
Abstract: Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.

Title: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10127
Pdf URL: https://arxiv.org/pdf/2503.10127
Copy Paste: [[2503.10127]] PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models(https://arxiv.org/abs/2503.10127)
Keywords: diffusion, transformer
Abstract: In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: this https URL.

Title: Deep Learning-Based Direct Leaf Area Estimation using Two RGBD Datasets for Model Development

Authors: Namal Jayasuriya, Yi Guo, Wen Hu, Oula Ghannoum
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10129
Pdf URL: https://arxiv.org/pdf/2503.10129
Copy Paste: [[2503.10129]] Deep Learning-Based Direct Leaf Area Estimation using Two RGBD Datasets for Model Development(https://arxiv.org/abs/2503.10129)
Keywords: segmentation
Abstract: Estimation of a single leaf area can be a measure of crop growth and a phenotypic trait to breed new varieties. It has also been used to measure leaf area index and total leaf area. Some studies have used hand-held cameras, image processing 3D reconstruction and unsupervised learning-based methods to estimate the leaf area in plant images. Deep learning works well for object detection and segmentation tasks; however, direct area estimation of objects has not been explored. This work investigates deep learning-based leaf area estimation, for RGBD images taken using a mobile camera setup in real-world scenarios. A dataset for attached leaves captured with a top angle view and a dataset for detached single leaves were collected for model development and testing. First, image processing-based area estimation was tested on manually segmented leaves. Then a Mask R-CNN-based model was investigated, and modified to accept RGBD images and to estimate the leaf area. The detached-leaf data set was then mixed with the attached-leaf plant data set to estimate the single leaf area for plant images, and another network design with two backbones was proposed: one for segmentation and the other for area estimation. Instead of trying all possibilities or random values, an agile approach was used in hyperparameter tuning. The final model was cross-validated with 5-folds and tested with two unseen datasets: detached and attached leaves. The F1 score with 90% IoA for segmentation result on unseen detached-leaf data was 1.0, while R-squared of area estimation was 0.81. For unseen plant data segmentation, the F1 score with 90% IoA was 0.59, while the R-squared score was 0.57. The research suggests using attached leaves with ground truth area to improve the results.

Title: Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C.H. Ngai, Emad Barsoum
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10135
Pdf URL: https://arxiv.org/pdf/2503.10135
Copy Paste: [[2503.10135]] Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding(https://arxiv.org/abs/2503.10135)
Keywords: transformer, large language model
Abstract: Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.

Title: Unlocking Generalization Power in LiDAR Point Cloud Registration

Authors: Zhenxuan Zeng, Qiao Wu, Xiyu Zhang, Lin Yuanbo Wu, Pei An, Jiaqi Yang, Ji Wang, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10149
Pdf URL: https://arxiv.org/pdf/2503.10149
Copy Paste: [[2503.10149]] Unlocking Generalization Power in LiDAR Point Cloud Registration(https://arxiv.org/abs/2503.10149)
Keywords: robust, extraction
Abstract: In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird's Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network's generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5% and 91.4%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9%. Code will be available at this https URL.

Title: Retrieval-Augmented Generation with Hierarchical Knowledge

Authors: Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10150
Pdf URL: https://arxiv.org/pdf/2503.10150
Copy Paste: [[2503.10150]] Retrieval-Augmented Generation with Hierarchical Knowledge(https://arxiv.org/abs/2503.10150)
Keywords: large language model
Abstract: Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods. The code of our proposed method is available at \href{this https URL}{this https URL}.

Title: "Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding

Authors: Hyunbin Jin, Je Won Yeom, Seunghyun Bae, Taesup Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10167
Pdf URL: https://arxiv.org/pdf/2503.10167
Copy Paste: [[2503.10167]] "Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding(https://arxiv.org/abs/2503.10167)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit strong reasoning abilities, often attributed to few-shot or zero-shot chain-of-thought (CoT) prompting. While effective, these methods require labor-intensive prompt engineering, raising the question of whether reasoning can be induced without reliance on explicit prompts. In this work, we unlock the reasoning capabilities of LLMs without explicit prompting. Inspired by zero-shot CoT and CoT-decoding, we propose a novel decoding strategy that systematically nudges LLMs to continue reasoning, thereby preventing immature reasoning processes. Specifically, we monitor the model's generation and inject a designated phrase whenever it is likely to conclude its response prematurely, before completing the reasoning process. Our experimental evaluations on diverse reasoning benchmarks demonstrate that our proposed strategy substantially improves LLM reasoning capabilities, highlighting the potential of decoding-based interventions as an alternative to traditional prompting techniques.

Title: Verifiable, Efficient and Confidentiality-Preserving Graph Search with Transparency

Authors: Qiuhao Wang, Xu Yang, Yiwei Liu, Saiyu Qi, Hongguang Zhao, Ke Li, Yong Qi
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2503.10171
Pdf URL: https://arxiv.org/pdf/2503.10171
Copy Paste: [[2503.10171]] Verifiable, Efficient and Confidentiality-Preserving Graph Search with Transparency(https://arxiv.org/abs/2503.10171)
Keywords: secure, privacy, protect
Abstract: Graph databases have garnered extensive attention and research due to their ability to manage relationships between entities efficiently. Today, many graph search services have been outsourced to a third-party server to facilitate storage and computational support. Nevertheless, the outsourcing paradigm may invade the privacy of graphs. PeGraph is the latest scheme achieving encrypted search over social graphs to address the privacy leakage, which maintains two data structures XSet and TSet motivated by the OXT technology to support encrypted conjunctive search. However, PeGraph still exhibits limitations inherent to the underlying OXT. It does not provide transparent search capabilities, suffers from expensive computation and result pattern leakages, and it fails to support search over dynamic encrypted graph database and results verification. In this paper, we propose SecGraph to address the first two limitations, which adopts a novel system architecture that leverages an SGX-enabled cloud server to provide users with secure and transparent search services since the secret key protection and computational overhead have been offloaded to the cloud server. Besides, we design an LDCF-encoded XSet based on the Logarithmic Dynamic Cuckoo Filter to facilitate efficient plaintext computation in trusted memory, effectively mitigating the risks of result pattern leakage and performance degradation due to exceeding the limited trusted memory capacity. Finally, we design a new dynamic version of TSet named Twin-TSet to enable conjunctive search over dynamic encrypted graph database. In order to support verifiable search, we further propose VSecGraph, which utilizes a procedure-oriented verification method to verify all data structures loaded into the trusted memory, thus bypassing the computational overhead associated with the client's local verification.

Title: PRISM: Preference Refinement via Implicit Scene Modeling for 3D Vision-Language Preference-Based Reinforcement Learning

Authors: Yirong Sun, Yanjun Chen
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10177
Pdf URL: https://arxiv.org/pdf/2503.10177
Copy Paste: [[2503.10177]] PRISM: Preference Refinement via Implicit Scene Modeling for 3D Vision-Language Preference-Based Reinforcement Learning(https://arxiv.org/abs/2503.10177)
Keywords: robust
Abstract: We propose PRISM, a novel framework designed to overcome the limitations of 2D-based Preference-Based Reinforcement Learning (PBRL) by unifying 3D point cloud modeling and future-aware preference refinement. At its core, PRISM adopts a 3D Point Cloud-Language Model (3D-PC-LLM) to mitigate occlusion and viewpoint biases, ensuring more stable and spatially consistent preference signals. Additionally, PRISM leverages Chain-of-Thought (CoT) reasoning to incorporate long-horizon considerations, thereby preventing the short-sighted feedback often seen in static preference comparisons. In contrast to conventional PBRL techniques, this integration of 3D perception and future-oriented reasoning leads to significant gains in preference agreement rates, faster policy convergence, and robust generalization across unseen robotic environments. Our empirical results, spanning tasks such as robotic manipulation and autonomous navigation, highlight PRISM's potential for real-world applications where precise spatial understanding and reliable long-term decision-making are critical. By bridging 3D geometric awareness with CoT-driven preference modeling, PRISM establishes a comprehensive foundation for scalable, human-aligned reinforcement learning.

Title: Robustness Tokens: Towards Adversarial Robustness of Transformers

Authors: Brian Pulfer, Yury Belousov, Slava Voloshynovskiy
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10191
Pdf URL: https://arxiv.org/pdf/2503.10191
Copy Paste: [[2503.10191]] Robustness Tokens: Towards Adversarial Robustness of Transformers(https://arxiv.org/abs/2503.10191)
Keywords: attack, robust, transformer
Abstract: Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

Title: ST-FlowNet: An Efficient Spiking Neural Network for Event-Based Optical Flow Estimation

Authors: Hongze Sun, Jun Wang, Wuque Cai, Duo Chen, Qianqian Liao, Jiayi He, Yan Cui, Dezhong Yao, Daqing Guo
Subjects: cs.CV, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.10195
Pdf URL: https://arxiv.org/pdf/2503.10195
Copy Paste: [[2503.10195]] ST-FlowNet: An Efficient Spiking Neural Network for Event-Based Optical Flow Estimation(https://arxiv.org/abs/2503.10195)
Keywords: robust
Abstract: Spiking Neural Networks (SNNs) have emerged as a promising tool for event-based optical flow estimation tasks due to their ability to leverage spatio-temporal information and low-power capabilities. However, the performance of SNN models is often constrained, limiting their application in real-world scenarios. In this work, we address this gap by proposing a novel neural network architecture, ST-FlowNet, specifically tailored for optical flow estimation from event-based data. The ST-FlowNet architecture integrates ConvGRU modules to facilitate cross-modal feature augmentation and temporal alignment of the predicted optical flow, improving the network's ability to capture complex motion dynamics. Additionally, to overcome the challenges associated with training SNNs, we introduce a novel approach to derive SNN models from pre-trained artificial neural networks (ANNs) through ANN-to-SNN conversion or our proposed BISNN method. Notably, the BISNN method alleviates the complexities involved in biological parameter selection, further enhancing the robustness of SNNs in optical flow estimation tasks. Extensive evaluations on three benchmark event-based datasets demonstrate that the SNN-based ST-FlowNet model outperforms state-of-the-art methods, delivering superior performance in accurate optical flow estimation across a diverse range of dynamic visual scenes. Furthermore, the inherent energy efficiency of SNN models is highlighted, establishing a compelling advantage for their practical deployment. Overall, our work presents a novel framework for optical flow estimation using SNNs and event-based data, contributing to the advancement of neuromorphic vision applications.

Title: Deep Learning for Time Series Forecasting: A Survey

Authors: Xiangjie Kong, Zhenghao Chen, Weiyao Liu, Kaili Ning, Lechao Zhang, Syauqie Muhammad Marier, Yichen Liu, Yuhao Chen, Feng Xia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10198
Pdf URL: https://arxiv.org/pdf/2503.10198
Copy Paste: [[2503.10198]] Deep Learning for Time Series Forecasting: A Survey(https://arxiv.org/abs/2503.10198)
Keywords: extraction
Abstract: Time series forecasting (TSF) has long been a crucial task in both industry and daily life. Most classical statistical models may have certain limitations when applied to practical scenarios in fields such as energy, healthcare, traffic, meteorology, and economics, especially when high accuracy is required. With the continuous development of deep learning, numerous new models have emerged in the field of time series forecasting in recent years. However, existing surveys have not provided a unified summary of the wide range of model architectures in this field, nor have they given detailed summaries of works in feature extraction and datasets. To address this gap, in this review, we comprehensively study the previous works and summarize the general paradigms of Deep Time Series Forecasting (DTSF) in terms of model architectures. Besides, we take an innovative approach by focusing on the composition of time series and systematically explain important feature extraction methods. Additionally, we provide an overall compilation of datasets from various domains in existing works. Finally, we systematically emphasize the significant challenges faced and future research directions in this field.

Title: LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Authors: Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10200
Pdf URL: https://arxiv.org/pdf/2503.10200
Copy Paste: [[2503.10200]] LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents(https://arxiv.org/abs/2503.10200)
Keywords: large language model
Abstract: Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA.

Title: Efficient Implementation of CRYSTALS-KYBER Key Encapsulation Mechanism on ESP32

Authors: Fabian Segatz, Muhammad Ihsan Al Hafiz
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2503.10207
Pdf URL: https://arxiv.org/pdf/2503.10207
Copy Paste: [[2503.10207]] Efficient Implementation of CRYSTALS-KYBER Key Encapsulation Mechanism on ESP32(https://arxiv.org/abs/2503.10207)
Keywords: secure
Abstract: Kyber, an IND-CCA2-secure lattice-based post-quantum key-encapsulation mechanism, is the winner of the first post-quantum cryptography standardization process of the US National Institute of Standards and Technology. In this work, we provide an efficient implementation of Kyber on ESP32, a very popular microcontroller for Internet of Things applications. We hand-partition the Kyber algorithm to enable utilization of the ESP32 dual-core architecture, which allows us to speed up its execution by 1.21x (keygen), 1.22x (encaps) and 1.20x (decaps). We also explore the possibility of gaining further improvement by utilizing the ESP32 SHA and AES coprocessor and achieve a culminated speed-up of 1.72x (keygen), 1.84x (encaps) and 1.69x (decaps).

Title: Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Authors: Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.10211
Pdf URL: https://arxiv.org/pdf/2503.10211
Copy Paste: [[2503.10211]] Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation(https://arxiv.org/abs/2503.10211)
Keywords: large language model
Abstract: Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.

Title: MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis

Authors: Teng Xu, Taotao Zhou, Youjia Wang, Peng Yang, Simin Tang, Kuixiang Shao, Zifeng Tang, Yifei Liu, Xinyuan Chen, Hongshuang Wang, Xiaohui Wang, Huoqing Luo, Jingya Wang, Ji Hu, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10212
Pdf URL: https://arxiv.org/pdf/2503.10212
Copy Paste: [[2503.10212]] MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis(https://arxiv.org/abs/2503.10212)
Keywords: interpretability
Abstract: Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.

Title: CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

Authors: Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10216
Pdf URL: https://arxiv.org/pdf/2503.10216
Copy Paste: [[2503.10216]] CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition(https://arxiv.org/abs/2503.10216)
Keywords: robust, diffusion
Abstract: Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world this http URL this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument this http URL, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing this http URL on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at this https URL.

Title: Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout

Authors: Shilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2503.10217
Pdf URL: https://arxiv.org/pdf/2503.10217
Copy Paste: [[2503.10217]] Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout(https://arxiv.org/abs/2503.10217)
Keywords: privacy, federate, transformer, large language model
Abstract: Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often employed and has emerged as the de facto paradigm. However, federated fine-tuning is prohibitively inefficient due to the tension between LLM complexity and the resource constraint of end devices, incurring unaffordable fine-tuning overhead. Existing literature primarily utilizes parameter-efficient fine-tuning techniques to mitigate communication costs, yet computational and memory burdens continue to pose significant challenges for developers. This work proposes DropPEFT, an innovative federated PEFT framework that employs a novel stochastic transformer layer dropout method, enabling devices to deactivate a considerable fraction of LLMs layers during training, thereby eliminating the associated computational load and memory footprint. In DropPEFT, a key challenge is the proper configuration of dropout ratios for layers, as overhead and training performance are highly sensitive to this setting. To address this challenge, we adaptively assign optimal dropout-ratio configurations to devices through an exploration-exploitation strategy, achieving efficient and effective fine-tuning. Extensive experiments show that DropPEFT can achieve a 1.3-6.3\times speedup in model convergence and a 40%-67% reduction in memory footprint compared to state-of-the-art methods.

Title: Moss: Proxy Model-based Full-Weight Aggregation in Federated Learning with Heterogeneous Models

Authors: Yifeng Cai, Ziqi Zhang, Ding Li, Yao Guo, Xiangqun Chen
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10218
Pdf URL: https://arxiv.org/pdf/2503.10218
Copy Paste: [[2503.10218]] Moss: Proxy Model-based Full-Weight Aggregation in Federated Learning with Heterogeneous Models(https://arxiv.org/abs/2503.10218)
Keywords: federate
Abstract: Modern Federated Learning (FL) has become increasingly essential for handling highly heterogeneous mobile devices. Current approaches adopt a partial model aggregation paradigm that leads to sub-optimal model accuracy and higher training overhead. In this paper, we challenge the prevailing notion of partial-model aggregation and propose a novel "full-weight aggregation" method named Moss, which aggregates all weights within heterogeneous models to preserve comprehensive knowledge. Evaluation across various applications demonstrates that Moss significantly accelerates training, reduces on-device training time and energy consumption, enhances accuracy, and minimizes network bandwidth utilization when compared to state-of-the-art baselines.

Title: Probability-Flow ODE in Infinite-Dimensional Function Spaces

Authors: Kunwoo Na, Junghyun Lee, Se-Young Yun, Sungbin Lim
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10219
Pdf URL: https://arxiv.org/pdf/2503.10219
Copy Paste: [[2503.10219]] Probability-Flow ODE in Infinite-Dimensional Function Spaces(https://arxiv.org/abs/2503.10219)
Keywords: diffusion
Abstract: Recent advances in infinite-dimensional diffusion models have demonstrated their effectiveness and scalability in function generation tasks where the underlying structure is inherently infinite-dimensional. To accelerate inference in such models, we derive, for the first time, an analog of the probability-flow ODE (PF-ODE) in infinite-dimensional function spaces. Leveraging this newly formulated PF-ODE, we reduce the number of function evaluations while maintaining sample quality in function generation tasks, including applications to PDEs.

Title: Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

Authors: Zhixuan Li, Hyunse Yoon, Sanghoon Lee, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10225
Pdf URL: https://arxiv.org/pdf/2503.10225
Copy Paste: [[2503.10225]] Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA(https://arxiv.org/abs/2503.10225)
Keywords: large language model, segmentation
Abstract: Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.

Title: Policy Teaching via Data Poisoning in Learning from Human Preferences

Authors: Andi Nika, Jonathan Nöther, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10228
Pdf URL: https://arxiv.org/pdf/2503.10228
Copy Paste: [[2503.10228]] Policy Teaching via Data Poisoning in Learning from Human Preferences(https://arxiv.org/abs/2503.10228)
Keywords: attack
Abstract: We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $\pi^\dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $\pi^\dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $\pi^\dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.

Title: R.U.Psycho? Robust Unified Psychometric Testing of Language Models

Authors: Julian Schelb, Orr Borin, David Garcia, Andreas Spitz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10229
Pdf URL: https://arxiv.org/pdf/2503.10229
Copy Paste: [[2503.10229]] R.U.Psycho? Robust Unified Psychometric Testing of Language Models(https://arxiv.org/abs/2503.10229)
Keywords: robust, generative
Abstract: Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present this http URL, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. this http URL is available as a Python package at this https URL.

Title: Post Quantum Migration of Tor

Authors: Denis Berger, Mouad Lemoudden, William J Buchanan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10238
Pdf URL: https://arxiv.org/pdf/2503.10238
Copy Paste: [[2503.10238]] Post Quantum Migration of Tor(https://arxiv.org/abs/2503.10238)
Keywords: secure, privacy, protect, attack
Abstract: Shor's and Grover's algorithms' efficiency and the advancement of quantum computers imply that the cryptography used until now to protect one's privacy is potentially vulnerable to retrospective decryption, also known as \emph{harvest now, decrypt later} attack in the near future. This dissertation proposes an overview of the cryptographic schemes used by Tor, highlighting the non-quantum-resistant ones and introducing theoretical performance assessment methods of a local Tor network. The measurement is divided into three phases. We will start with benchmarking a local Tor network simulation on constrained devices to isolate the time taken by classical cryptography processes. Secondly, the analysis incorporates existing benchmarks of quantum-secure algorithms and compares these performances on the devices. Lastly, the estimation of overhead is calculated by replacing the measured times of traditional cryptography with the times recorded for Post Quantum Cryptography (PQC) execution within the specified Tor environment. By focusing on the replaceable cryptographic components, using theoretical estimations, and leveraging existing benchmarks, valuable insights into the potential impact of PQC can be obtained without needing to implement it fully.

Title: I Can Tell Your Secrets: Inferring Privacy Attributes from Mini-app Interaction History in Super-apps

Authors: Yifeng Cai, Ziqi Zhang, Mengyu Yao, Junlin Liu, Xiaoke Zhao, Xinyi Fu, Ruoyu Li, Zhe Li, Xiangqun Chen, Yao Guo, Ding Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10239
Pdf URL: https://arxiv.org/pdf/2503.10239
Copy Paste: [[2503.10239]] I Can Tell Your Secrets: Inferring Privacy Attributes from Mini-app Interaction History in Super-apps(https://arxiv.org/abs/2503.10239)
Keywords: privacy, protect, attack
Abstract: Super-apps have emerged as comprehensive platforms integrating various mini-apps to provide diverse services. While super-apps offer convenience and enriched functionality, they can introduce new privacy risks. This paper reveals a new privacy leakage source in super-apps: mini-app interaction history, including mini-app usage history (Mini-H) and operation history (Op-H). Mini-H refers to the history of mini-apps accessed by users, such as their frequency and categories. Op-H captures user interactions within mini-apps, including button clicks, bar drags, and image views. Super-apps can naturally collect these data without instrumentation due to the web-based feature of mini-apps. We identify these data types as novel and unexplored privacy risks through a literature review of 30 papers and an empirical analysis of 31 super-apps. We design a mini-app interaction history-oriented inference attack (THEFT), to exploit this new vulnerability. Using THEFT, the insider threats within the low-privilege business department of the super-app vendor acting as the adversary can achieve more than 95.5% accuracy in inferring privacy attributes of over 16.1% of users. THEFT only requires a small training dataset of 200 users from public breached databases on the Internet. We also engage with super-app vendors and a standards association to increase industry awareness and commitment to protect this data. Our contributions are significant in identifying overlooked privacy risks, demonstrating the effectiveness of a new attack, and influencing industry practices toward better privacy protection in the super-app ecosystem.

Title: MinorBench: A hand-built benchmark for content-based risks for children

Authors: Shaun Khoo, Gabriel Chua, Rachel Shong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10242
Pdf URL: https://arxiv.org/pdf/2503.10242
Copy Paste: [[2503.10242]] MinorBench: A hand-built benchmark for content-based risks for children(https://arxiv.org/abs/2503.10242)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are rapidly entering children's lives - through parent-driven adoption, schools, and peer networks - yet current AI ethics and safety research do not adequately address content-related risks specific to minors. In this paper, we highlight these gaps with a real-world case study of an LLM-based chatbot deployed in a middle school setting, revealing how students used and sometimes misused the system. Building on these findings, we propose a new taxonomy of content-based risks for minors and introduce MinorBench, an open-source benchmark designed to evaluate LLMs on their ability to refuse unsafe or inappropriate queries from children. We evaluate six prominent LLMs under different system prompts, demonstrating substantial variability in their child-safety compliance. Our results inform practical steps for more robust, child-focused safety mechanisms and underscore the urgency of tailoring AI systems to safeguard young users.

Title: Interpretable Image Classification via Non-parametric Part Prototype Learning

Authors: Zhijie Zhu, Lei Fan, Maurice Pagnucco, Yang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10247
Pdf URL: https://arxiv.org/pdf/2503.10247
Copy Paste: [[2503.10247]] Interpretable Image Classification via Non-parametric Part Prototype Learning(https://arxiv.org/abs/2503.10247)
Keywords: robust, interpretability
Abstract: Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: this https URL.

Title: SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

Authors: Zhi Chen, Zecheng Zhao, Jingcai Guo, Jingjing Li, Zi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10252
Pdf URL: https://arxiv.org/pdf/2503.10252
Copy Paste: [[2503.10252]] SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning(https://arxiv.org/abs/2503.10252)
Keywords: extraction, transformer
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.

Title: PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction

Authors: Han Wan, Qi Wang, Hao Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10253
Pdf URL: https://arxiv.org/pdf/2503.10253
Copy Paste: [[2503.10253]] PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction(https://arxiv.org/abs/2503.10253)
Keywords: robust
Abstract: Simulation of spatiotemporal systems governed by partial differential equations is widely applied in fields such as biology, chemistry, aerospace dynamics, and meteorology. Traditional numerical methods incur high computational costs due to the requirement of small time steps for accurate predictions. While machine learning has reduced these costs, long-term predictions remain challenged by error accumulation, particularly in scenarios with insufficient data or varying time scales, where stability and accuracy are compromised. Existing methods often neglect the effective utilization of multi-scale data, leading to suboptimal robustness in predictions. To address these issues, we propose a novel multi-scale learning framework, namely, the Physics-Informed Multi-Scale Recurrent Learning (PIMRL), to effectively leverage multi-scale data for spatiotemporal dynamics prediction. The PIMRL framework comprises two modules: the micro-scale module embeds physical knowledge into neural networks via pretraining, and the macro-scale module adopts a data-driven approach to learn the temporal evolution of physics in the latent space. Experimental results demonstrate that the PIMRL framework consistently achieves state-of-the-art performance across five benchmark datasets ranging from one to three dimensions, showing average improvements of over 9\% in both RMSE and MAE evaluation metrics, with maximum enhancements reaching up to 80%.

Title: An Open-RAN Testbed for Detecting and Mitigating Radio-Access Anomalies

Authors: Hanna Bogucka, Marcin Hoffmann, Paweł Kryszkiewicz, Łukasz Kułacz
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10255
Pdf URL: https://arxiv.org/pdf/2503.10255
Copy Paste: [[2503.10255]] An Open-RAN Testbed for Detecting and Mitigating Radio-Access Anomalies(https://arxiv.org/abs/2503.10255)
Keywords: secure, attack
Abstract: This paper presents the Open Radio Access Net-work (O-RAN) testbed for secure radio access. We discuss radio-originating attack detection and mitigation methods based on anomaly detection and how they can be implemented as specialized applications (xApps) in this testbed. We also pre-sent illustrating results of the methods applied in real-world scenarios and implementations.

Title: ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Authors: Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10256
Pdf URL: https://arxiv.org/pdf/2503.10256
Copy Paste: [[2503.10256]] ROODI: Reconstructing Occluded Objects with Denoising Inpainters(https://arxiv.org/abs/2503.10256)
Keywords: extraction, diffusion, generative
Abstract: While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

Title: AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation

Authors: Zeyi Xu, Jinfan Liu, Kuangxu Chen, Ye Chen, Zhangli Hu, Bingbing Ni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10257
Pdf URL: https://arxiv.org/pdf/2503.10257
Copy Paste: [[2503.10257]] AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation(https://arxiv.org/abs/2503.10257)
Keywords: extraction, transformer
Abstract: Accurately and efficiently simulating complex fluid dynamics is a challenging task that has traditionally relied on computationally intensive methods. Neural network-based approaches, such as convolutional and graph neural networks, have partially alleviated this burden by enabling efficient local feature extraction. However, they struggle to capture long-range dependencies due to limited receptive fields, and Transformer-based models, while providing global context, incur prohibitive computational costs. To tackle these challenges, we propose AMR-Transformer, an efficient and accurate neural CFD-solving pipeline that integrates a novel adaptive mesh refinement scheme with a Navier-Stokes constraint-aware fast pruning module. This design encourages long-range interactions between simulation cells and facilitates the modeling of global fluid wave patterns, such as turbulence and shockwaves. Experiments show that our approach achieves significant gains in efficiency while preserving critical details, making it suitable for high-resolution physical simulations with long-range dependencies. On CFDBench, PDEBench and a new shockwave dataset, our pipeline demonstrates up to an order-of-magnitude improvement in accuracy over baseline models. Additionally, compared to ViT, our approach achieves a reduction in FLOPs of up to 60 times.

Title: A Multi-Modal Federated Learning Framework for Remote Sensing Image Classification

Authors: Barış Büyüktaş, Gencer Sumbul, Begüm Demir
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10262
Pdf URL: https://arxiv.org/pdf/2503.10262
Copy Paste: [[2503.10262]] A Multi-Modal Federated Learning Framework for Remote Sensing Image Classification(https://arxiv.org/abs/2503.10262)
Keywords: federate
Abstract: Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients) without sharing the local data of the clients. Most of the existing FL methods assume that the data distributed across all clients is associated with the same data modality. However, remote sensing (RS) images present in different clients can be associated with diverse data modalities. The joint use of the multi-modal RS data can significantly enhance classification performance. To effectively exploit decentralized and unshared multi-modal RS data, our paper introduces a novel multi-modal FL framework for RS image classification problems. The proposed framework comprises three modules: 1) multi-modal fusion (MF); 2) feature whitening (FW); and 3) mutual information maximization (MIM). The MF module employs iterative model averaging to facilitate learning without accessing multi-modal training data on clients. The FW module aims to address the limitations of training data heterogeneity by aligning data distributions across clients. The MIM module aims to model mutual information by maximizing the similarity between images from different modalities. For the experimental analyses, we focus our attention on multi-label classification and pixel-based classification tasks in RS. The results obtained using two benchmark archives show the effectiveness of the proposed framework when compared to state-of-the-art algorithms in the literature. The code of the proposed framework will be available at this https URL.

Title: An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Authors: Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, and Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, and Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, and Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, and Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, and Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, and Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, and Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10267
Pdf URL: https://arxiv.org/pdf/2503.10267
Copy Paste: [[2503.10267]] An Expanded Massive Multilingual Dataset for High-Performance Language Technologies(https://arxiv.org/abs/2503.10267)
Keywords: large language model
Abstract: Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Title: Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification

Authors: Wassim Bouaziz, El-Mahdi El-Mhamdi, Nicolas Usunier
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10269
Pdf URL: https://arxiv.org/pdf/2503.10269
Copy Paste: [[2503.10269]] Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification(https://arxiv.org/abs/2503.10269)
Keywords: protect, robust, watermark, transformer
Abstract: Protecting the use of audio datasets is a major concern for data owners, particularly with the recent rise of audio deep learning models. While watermarks can be used to protect the data itself, they do not allow to identify a deep learning model trained on a protected dataset. In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset with top-$k$ predictions access to the model only. This method relies on a targeted data poisoning scheme by discreetly altering a small fraction (1%) of the dataset as to induce a harmless behavior on out-of-distribution data called keys. We evaluate our method on the Speechcommands and the ESC50 datasets and state of the art transformer models, and show that we can detect the use of the dataset with high confidence without loss of performance. We also show the robustness of our method against common data augmentation techniques, making it a practical method to protect audio datasets.

Title: HyperArm Bandit Optimization: A Novel approach to Hyperparameter Optimization and an Analysis of Bandit Algorithms in Stochastic and Adversarial Settings

Authors: Samih Karroum, Saad Mazhar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10282
Pdf URL: https://arxiv.org/pdf/2503.10282
Copy Paste: [[2503.10282]] HyperArm Bandit Optimization: A Novel approach to Hyperparameter Optimization and an Analysis of Bandit Algorithms in Stochastic and Adversarial Settings(https://arxiv.org/abs/2503.10282)
Keywords: robust
Abstract: This paper explores the application of bandit algorithms in both stochastic and adversarial settings, with a focus on theoretical analysis and practical applications. The study begins by introducing bandit problems, distinguishing between stochastic and adversarial variants, and examining key algorithms such as Explore-Then-Commit (ETC), Upper Confidence Bound (UCB), and Exponential-Weight Algorithm for Exploration and Exploitation (EXP3). Theoretical regret bounds are analyzed to compare the performance of these algorithms. The paper then introduces a novel framework, HyperArm Bandit Optimization (HABO), which applies EXP3 to hyperparameter tuning in machine learning models. Unlike traditional methods that treat entire configurations as arms, HABO treats individual hyperparameters as super-arms, and its potential configurations as sub-arms, enabling dynamic resource allocation and efficient exploration. Experimental results demonstrate HABO's effectiveness in classification and regression tasks, outperforming Bayesian Optimization in terms of computational efficiency and accuracy. The paper concludes with insights into the convergence guarantees of HABO and its potential for scalable and robust hyperparameter optimization.

Title: VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Authors: Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, Peidong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10286
Pdf URL: https://arxiv.org/pdf/2503.10286
Copy Paste: [[2503.10286]] VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames(https://arxiv.org/abs/2503.10286)
Keywords: transformer
Abstract: We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: this https URL.

Title: MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Authors: Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10289
Pdf URL: https://arxiv.org/pdf/2503.10289
Copy Paste: [[2503.10289]] MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion(https://arxiv.org/abs/2503.10289)
Keywords: diffusion
Abstract: Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.

Title: VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Authors: Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10291
Pdf URL: https://arxiv.org/pdf/2503.10291
Copy Paste: [[2503.10291]] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning(https://arxiv.org/abs/2503.10291)
Keywords: large language model
Abstract: We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in this https URL.

Title: Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models

Authors: Sebastian Möller, Pia Knoeferle, Britta Schulte, Nils Feldhus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10298
Pdf URL: https://arxiv.org/pdf/2503.10298
Copy Paste: [[2503.10298]] Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models(https://arxiv.org/abs/2503.10298)
Keywords: extraction, large language model
Abstract: Machine learning techniques have conquered many different tasks in speech and natural language processing, such as speech recognition, information extraction, text and speech generation, and human machine interaction using natural language or speech (chatbots). Modern techniques typically rely on large models for representing general knowledge of one or several languages (Large Language Models, LLMs), or for representing speech and general audio characteristics. These models have been trained with large amounts of speech and language data, typically including web content. When humans interact with such technologies, the effectiveness of the interaction will be influenced by how far humans make use of the same type of language the models have been trained on or, in other words, if the models are able to generalize to the language used by humans when interacting with the technology. This may lead to some gradual forms of adaptation in human speech and language production, and users who do not adapt may be excluded from efficient use of such technologies. On top of this, as commercial model development follows market needs, under-represented languages and dialects/sociolects may decrease in terms of priorities. Furthermore, for many lesser spoken languages the necessary data is not available, which will worsen a digital divide in speech and language technology usage. The workshop sets out to discuss this problem based on scientific contributions from the perspective of computer science and linguistics (including computational linguistics and NLP).

Title: Eye on the Target: Eye Tracking Meets Rodent Tracking

Authors: Emil Mededovic, Yuli Wu, Henning Konermann, Marcin Kopaczka, Mareike Schulz, Rene Tolba, Johannes Stegmaier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10305
Pdf URL: https://arxiv.org/pdf/2503.10305
Copy Paste: [[2503.10305]] Eye on the Target: Eye Tracking Meets Rodent Tracking(https://arxiv.org/abs/2503.10305)
Keywords: segmentation
Abstract: Analyzing animal behavior from video recordings is crucial for scientific research, yet manual annotation remains labor-intensive and prone to subjectivity. Efficient segmentation methods are needed to automate this process while maintaining high accuracy. In this work, we propose a novel pipeline that utilizes eye-tracking data from Aria glasses to generate prompt points, which are then used to produce segmentation masks via a fast zero-shot segmentation model. Additionally, we apply post-processing to refine the prompts, leading to improved segmentation quality. Through our approach, we demonstrate that combining eye-tracking-based annotation with smart prompt refinement can enhance segmentation accuracy, achieving an improvement of 70.6% from 38.8 to 66.2 in the Jaccard Index for segmentation results in the rats dataset.

Title: IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Authors: Yuhao Wang, Yongfeng Lv, Pingping Zhang, Huchuan Lu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10324
Pdf URL: https://arxiv.org/pdf/2503.10324
Copy Paste: [[2503.10324]] IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification(https://arxiv.org/abs/2503.10324)
Keywords: robust, large language model
Abstract: Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.

Title: OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions

Authors: Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10331
Pdf URL: https://arxiv.org/pdf/2503.10331
Copy Paste: [[2503.10331]] OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions(https://arxiv.org/abs/2503.10331)
Keywords: robust, segmentation
Abstract: Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Our code is available at this https URL.

Title: Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings

Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10333
Pdf URL: https://arxiv.org/pdf/2503.10333
Copy Paste: [[2503.10333]] Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings(https://arxiv.org/abs/2503.10333)
Keywords: generative
Abstract: In dynamic environments where new concepts continuously emerge, Deep Neural Networks (DNNs) must adapt by learning new classes while retaining previously acquired ones. This challenge is addressed by Class-Incremental Learning (CIL). This paper introduces Generative Binary Memory (GBM), a novel CIL pseudo-replay approach which generates synthetic binary pseudo-exemplars. Relying on Bernoulli Mixture Models (BMMs), GBM effectively models the multi-modal characteristics of class distributions, in a latent, binary space. With a specifically-designed feature binarizer, our approach applies to any conventional DNN. GBM also natively supports Binary Neural Networks (BNNs) for highly-constrained model sizes in embedded systems. The experimental results demonstrate that GBM achieves higher than state-of-the-art average accuracy on CIFAR100 (+2.9%) and TinyImageNet (+1.5%) for a ResNet-18 equipped with our binarizer. GBM also outperforms emerging CIL methods for BNNs, with +3.1% in final accuracy and x4.7 memory reduction, on CORE50.

Title: KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

Authors: Vivek Chari, Guanghui Qin, Benjamin Van Durme
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10337
Pdf URL: https://arxiv.org/pdf/2503.10337
Copy Paste: [[2503.10337]] KV-Distill: Nearly Lossless Learnable Context Compression for LLMs(https://arxiv.org/abs/2503.10337)
Keywords: transformer
Abstract: Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

Title: DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image

Authors: Qi Zhao, Zhan Ma, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10342
Pdf URL: https://arxiv.org/pdf/2503.10342
Copy Paste: [[2503.10342]] DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image(https://arxiv.org/abs/2503.10342)
Keywords: diffusion, generative
Abstract: Recent developments in generative diffusion models have turned many dreams into realities. For video object insertion, existing methods typically require additional information, such as a reference video or a 3D asset of the object, to generate the synthetic motion. However, inserting an object from a single reference photo into a target background video remains an uncharted area due to the lack of unseen motion information. We propose DreamInsert, which achieves Image-to-Video Object Insertion in a training-free manner for the first time. By incorporating the trajectory of the object into consideration, DreamInsert can predict the unseen object movement, fuse it harmoniously with the background video, and generate the desired video seamlessly. More significantly, DreamInsert is both simple and effective, achieving zero-shot insertion without end-to-end training or additional fine-tuning on well-designed image-video data pairs. We demonstrated the effectiveness of DreamInsert through a variety of experiments. Leveraging this capability, we present the first results for Image-to-Video object insertion in a training-free manner, paving exciting new directions for future content creation and synthesis. The code will be released soon.

Title: Enhancing Facial Privacy Protection via Weakening Diffusion Purification

Authors: Ali Salar, Qing Liu, Yingli Tian, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10350
Pdf URL: https://arxiv.org/pdf/2503.10350
Copy Paste: [[2503.10350]] Enhancing Facial Privacy Protection via Weakening Diffusion Purification(https://arxiv.org/abs/2503.10350)
Keywords: privacy, protect, diffusion
Abstract: The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.

Title: New Trends for Modern Machine Translation with Large Reasoning Models

Authors: Sinuo Liu, Chenyang Lyu, Minghao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10351
Pdf URL: https://arxiv.org/pdf/2503.10351
Copy Paste: [[2503.10351]] New Trends for Modern Machine Translation with Large Reasoning Models(https://arxiv.org/abs/2503.10351)
Keywords: robust
Abstract: Recent advances in Large Reasoning Models (LRMs), particularly those leveraging Chain-of-Thought reasoning (CoT), have opened brand new possibility for Machine Translation (MT). This position paper argues that LRMs substantially transformed traditional neural MT as well as LLMs-based MT paradigms by reframing translation as a dynamic reasoning task that requires contextual, cultural, and linguistic understanding and reasoning. We identify three foundational shifts: 1) contextual coherence, where LRMs resolve ambiguities and preserve discourse structure through explicit reasoning over cross-sentence and complex context or even lack of context; 2) cultural intentionality, enabling models to adapt outputs by inferring speaker intent, audience expectations, and socio-linguistic norms; 3) self-reflection, LRMs can perform self-reflection during the inference time to correct the potential errors in translation especially extremely noisy cases, showing better robustness compared to simply mapping X->Y translation. We explore various scenarios in translation including stylized translation, document-level translation and multimodal translation by showcasing empirical examples that demonstrate the superiority of LRMs in translation. We also identify several interesting phenomenons for LRMs for MT including auto-pivot translation as well as the critical challenges such as over-localisation in translation and inference efficiency. In conclusion, we think that LRMs redefine translation systems not merely as text converters but as multilingual cognitive agents capable of reasoning about meaning beyond the text. This paradigm shift reminds us to think of problems in translation beyond traditional translation scenarios in a much broader context with LRMs - what we can achieve on top of it.

Title: A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization

Authors: Nevidu Jayatilleke, Ruvan Weerasinghe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10354
Pdf URL: https://arxiv.org/pdf/2503.10354
Copy Paste: [[2503.10354]] A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization(https://arxiv.org/abs/2503.10354)
Keywords: transformer
Abstract: Automatic patent summarization approaches that help in the patent analysis and comprehension procedure are in high demand due to the colossal growth of innovations. The development of natural language processing (NLP), text mining, and deep learning has notably amplified the efficacy of text summarization models for abundant types of documents. Summarizing patent text remains a pertinent challenge due to the labyrinthine writing style of these documents, which includes technical and legal intricacies. Additionally, these patent document contents are considerably lengthier than archetypal documents, which intricates the process of extracting pertinent information for summarization. Embodying extractive and abstractive text summarization methodologies into a hybrid framework, this study proposes a system for efficiently creating abstractive summaries of patent records. The procedure involves leveraging the LexRank graph-based algorithm to retrieve the important sentences from input parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART) model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for producing text summaries. This is accompanied by methodical testing and evaluation strategies. Furthermore, the author employed certain meta-learning techniques to achieve Domain Generalization (DG) of the abstractive component across multiple patent fields.

Title: ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation

Authors: Zirun Guo, Tao Jin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10358
Pdf URL: https://arxiv.org/pdf/2503.10358
Copy Paste: [[2503.10358]] ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation(https://arxiv.org/abs/2503.10358)
Keywords: diffusion
Abstract: Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present ConceptGuard, a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. These strategies can dynamically update, unbind and learn the relationship of the previous concepts, thus alleviating concept forgetting and confusion. Through comprehensive experiments, we show that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.

Title: Piece it Together: Part-Based Concepting with IP-Priors

Authors: Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10365
Pdf URL: https://arxiv.org/pdf/2503.10365
Copy Paste: [[2503.10365]] Piece it Together: Part-Based Concepting with IP-Priors(https://arxiv.org/abs/2503.10365)
Keywords: generative
Abstract: Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Title: G-Boost: Boosting Private SLMs with General LLMs

Authors: Yijiang Fan, Yuren Mao, Longbin Lai, Ying Zhang, Zhengping Qian, Yunjun Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10367
Pdf URL: https://arxiv.org/pdf/2503.10367
Copy Paste: [[2503.10367]] G-Boost: Boosting Private SLMs with General LLMs(https://arxiv.org/abs/2503.10367)
Keywords: large language model
Abstract: Due to the limited computational resources, most Large Language Models (LLMs) developers can only fine-tune Small Language Models (SLMs) on their own data. These private SLMs typically have limited effectiveness. To boost the performance of private SLMs, this paper proposes to ask general LLMs for help. The general LLMs can be APIs or larger LLMs whose inference cost the developers can afford. Specifically, we propose the G-Boost framework where a private SLM adaptively performs collaborative inference with a general LLM under the guide of process reward. Experiments demonstrate that our framework can significantly boost the performance of private SLMs.

Title: Probabilistic Forecasting via Autoregressive Flow Matching

Authors: Ahmed El-Gazzar, Marcel van Gerven
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10375
Pdf URL: https://arxiv.org/pdf/2503.10375
Copy Paste: [[2503.10375]] Probabilistic Forecasting via Autoregressive Flow Matching(https://arxiv.org/abs/2503.10375)
Keywords: generative
Abstract: In this work, we propose FlowTime, a generative model for probabilistic forecasting of multivariate timeseries data. Given historical measurements and optional future covariates, we formulate forecasting as sampling from a learned conditional distribution over future trajectories. Specifically, we decompose the joint distribution of future observations into a sequence of conditional densities, each modeled via a shared flow that transforms a simple base distribution into the next observation distribution, conditioned on observed covariates. To achieve this, we leverage the flow matching (FM) framework, enabling scalable and simulation-free learning of these transformations. By combining this factorization with the FM objective, FlowTime retains the benefits of autoregressive models -- including strong extrapolation performance, compact model size, and well-calibrated uncertainty estimates -- while also capturing complex multi-modal conditional distributions, as seen in modern transport-based generative models. We demonstrate the effectiveness of FlowTime on multiple dynamical systems and real-world forecasting tasks.

Title: CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

Authors: Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10391
Pdf URL: https://arxiv.org/pdf/2503.10391
Copy Paste: [[2503.10391]] CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance(https://arxiv.org/abs/2503.10391)
Keywords: diffusion, generative, large language model
Abstract: Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.

Title: RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Authors: Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10392
Pdf URL: https://arxiv.org/pdf/2503.10392
Copy Paste: [[2503.10392]] RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing(https://arxiv.org/abs/2503.10392)
Keywords: transformer, segmentation
Abstract: Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at this https URL.

Title: Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

Authors: Jingyu Guo, Sensen Gao, Jia-Wang Bian, Wanhu Sun, Heliang Zheng, Rongfei Jia, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10403
Pdf URL: https://arxiv.org/pdf/2503.10403
Copy Paste: [[2503.10403]] Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders(https://arxiv.org/abs/2503.10403)
Keywords: diffusion
Abstract: Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.

Title: RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Authors: Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10406
Pdf URL: https://arxiv.org/pdf/2503.10406
Copy Paste: [[2503.10406]] RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models(https://arxiv.org/abs/2503.10406)
Keywords: large language model
Abstract: Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: this https URL

Title: Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning

Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10408
Pdf URL: https://arxiv.org/pdf/2503.10408
Copy Paste: [[2503.10408]] Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning(https://arxiv.org/abs/2503.10408)
Keywords: large language model
Abstract: We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops''). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.

Title: Public Channel-Based Fair Exchange Protocols with Advertising

Authors: Pierpaolo Della Monica, Ivan Visconti, Andrea Vitaletti, Marco Zecchini
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10411
Pdf URL: https://arxiv.org/pdf/2503.10411
Copy Paste: [[2503.10411]] Public Channel-Based Fair Exchange Protocols with Advertising(https://arxiv.org/abs/2503.10411)
Keywords: fair
Abstract: Before a fair exchange takes place, there is typically an advertisement phase with the goal of increasing the appeal of possessing a digital asset while keeping it sufficiently hidden. In this work, we give a definition that explicitly combines a fair-exchange protocol with a prior advertising phase. Then, we construct such a fair exchange protocol with aids using zk-SNARKs and relying on mainstream decentralized platforms (i.e., a blockchain with smart contracts like Ethereum and a decentralized storage system like IPFS). Experimental results confirm the practical relevance of our decentralized approach, paving the road towards building decentralized marketplaces where users can, even anonymously, and without direct off-chain communications, effectively advertise and exchange their digital assets as part of a system of enhanced NFTs.

Title: dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis

Authors: Luyuan Xie, Tianyu Luan, Wenyuan Cai, Guochen Yan, Zhaoyu Chen, Nan Xi, Yuejian Fang, Qingni Shen, Zhonghai Wu, Junsong Yuan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10412
Pdf URL: https://arxiv.org/pdf/2503.10412
Copy Paste: [[2503.10412]] dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis(https://arxiv.org/abs/2503.10412)
Keywords: privacy, protect, robust, federate
Abstract: Federated learning has wide applications in the medical field. It enables knowledge sharing among different healthcare institutes while protecting patients' privacy. However, existing federated learning systems are typically centralized, requiring clients to upload client-specific knowledge to a central server for aggregation. This centralized approach would integrate the knowledge from each client into a centralized server, and the knowledge would be already undermined during the centralized integration before it reaches back to each client. Besides, the centralized approach also creates a dependency on the central server, which may affect training stability if the server malfunctions or connections are unstable. To address these issues, we propose a decentralized federated learning framework named dFLMoE. In our framework, clients directly exchange lightweight head models with each other. After exchanging, each client treats both local and received head models as individual experts, and utilizes a client-specific Mixture of Experts (MoE) approach to make collective decisions. This design not only reduces the knowledge damage with client-specific aggregations but also removes the dependency on the central server to enhance the robustness of the framework. We validate our framework on multiple medical tasks, demonstrating that our method evidently outperforms state-of-the-art approaches under both model homogeneity and heterogeneity settings.

Title: Category Prompt Mamba Network for Nuclei Segmentation and Classification

Authors: Ye Zhang, Zijie Fang, Yifeng Wang, Lingbo Zhang, Xianchao Guan, Yongbing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10422
Pdf URL: https://arxiv.org/pdf/2503.10422
Copy Paste: [[2503.10422]] Category Prompt Mamba Network for Nuclei Segmentation and Classification(https://arxiv.org/abs/2503.10422)
Keywords: segmentation
Abstract: Nuclei segmentation and classification provide an essential basis for tumor immune microenvironment analysis. The previous nuclei segmentation and classification models require splitting large images into smaller patches for training, leading to two significant issues. First, nuclei at the borders of adjacent patches often misalign during inference. Second, this patch-based approach significantly increases the model's training and inference time. Recently, Mamba has garnered attention for its ability to model large-scale images with linear time complexity and low memory consumption. It offers a promising solution for training nuclei segmentation and classification models on full-sized images. However, the Mamba orientation-based scanning method lacks account for category-specific features, resulting in sub-optimal performance in scenarios with imbalanced class distributions. To address these challenges, this paper introduces a novel scanning strategy based on category probability sorting, which independently ranks and scans features for each category according to confidence from high to low. This approach enhances the feature representation of uncertain samples and mitigates the issues caused by imbalanced distributions. Extensive experiments conducted on four public datasets demonstrate that our method outperforms state-of-the-art approaches, delivering superior performance in nuclei segmentation and classification tasks.

Title: Improving Medical Waste Classification with Hybrid Capsule Networks

Authors: Bennet van den Broek, Javad Pourmostafa Roshan Sharami
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10426
Pdf URL: https://arxiv.org/pdf/2503.10426
Copy Paste: [[2503.10426]] Improving Medical Waste Classification with Hybrid Capsule Networks(https://arxiv.org/abs/2503.10426)
Keywords: robust
Abstract: The improper disposal and mismanagement of medical waste pose severe environmental and public health risks, contributing to greenhouse gas emissions and the spread of infectious diseases. Efficient and accurate medical waste classification is crucial for mitigating these risks. We explore the integration of capsule networks with a pretrained DenseNet model to improve medical waste classification. To the best of our knowledge, capsule networks have not yet been applied to this task, making this study the first to assess their effectiveness. A diverse dataset of medical waste images collected from multiple public sources, is used to evaluate three model configurations: (1) a pretrained DenseNet model as a baseline, (2) a pretrained DenseNet with frozen layers combined with a capsule network, and (3) a pretrained DenseNet with unfrozen layers combined with a capsule network. Experimental results demonstrate that incorporating capsule networks improves classification performance, with F1 scores increasing from 0.89 (baseline) to 0.92 (hybrid model with unfrozen layers). This highlights the potential of capsule networks to address the spatial limitations of traditional convolutional models and improve classification robustness. While the capsule-enhanced model demonstrated improved classification performance, direct comparisons with prior studies were challenging due to differences in dataset size and diversity. Previous studies relied on smaller, domain-specific datasets, which inherently yielded higher accuracy. In contrast, our study employs a significantly larger and more diverse dataset, leading to better generalization but introducing additional classification challenges. This highlights the trade-off between dataset complexity and model performance.

Title: BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models

Authors: Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10432
Pdf URL: https://arxiv.org/pdf/2503.10432
Copy Paste: [[2503.10432]] BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models(https://arxiv.org/abs/2503.10432)
Keywords: large language model
Abstract: In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs' cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images and aligns visual-temporal features with LLMs' semantic space through reprogramming techniques. Evaluated on a realistic vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01% top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks, significantly outperforming traditional deep learning models. In few-shot prediction scenarios, the performance degradation is limited to 12.56% (top-1) and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction capability.

Title: 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Authors: Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10437
Pdf URL: https://arxiv.org/pdf/2503.10437
Copy Paste: [[2503.10437]] 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models(https://arxiv.org/abs/2503.10437)
Keywords: large language model
Abstract: Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

Title: DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, Kaidi Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10452
Pdf URL: https://arxiv.org/pdf/2503.10452
Copy Paste: [[2503.10452]] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation(https://arxiv.org/abs/2503.10452)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code.

Title: Sentiment Analysis in SemEval: A Review of Sentiment Identification Approaches

Authors: Bousselham El Haddaoui, Raddouane Chiheb, Rdouan Faizi, Abdellatif El Afia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10457
Pdf URL: https://arxiv.org/pdf/2503.10457
Copy Paste: [[2503.10457]] Sentiment Analysis in SemEval: A Review of Sentiment Identification Approaches(https://arxiv.org/abs/2503.10457)
Keywords: transformer
Abstract: Social media platforms are becoming the foundations of social interactions including messaging and opinion expression. In this regard, Sentiment Analysis techniques focus on providing solutions to ensure the retrieval and analysis of generated data including sentiments, emotions, and discussed topics. International competitions such as the International Workshop on Semantic Evaluation (SemEval) have attracted many researchers and practitioners with a special research interest in building sentiment analysis systems. In our work, we study top-ranking systems for each SemEval edition during the 2013-2021 period, a total of 658 teams participated in these editions with increasing interest over years. We analyze the proposed systems marking the evolution of research trends with a focus on the main components of sentiment analysis systems including data acquisition, preprocessing, and classification. Our study shows an active use of preprocessing techniques, an evolution of features engineering and word representation from lexicon-based approaches to word embeddings, and the dominance of neural networks and transformers over the classification phase fostering the use of ready-to-use models. Moreover, we provide researchers with insights based on experimented systems which will allow rapid prototyping of new systems and help practitioners build for future SemEval editions.

Title: LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions

Authors: Gaurav Kumar Gupta, Pranal Pande
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10486
Pdf URL: https://arxiv.org/pdf/2503.10486
Copy Paste: [[2503.10486]] LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions(https://arxiv.org/abs/2503.10486)
Keywords: privacy, interpretability, large language model
Abstract: Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.

Title: Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Authors: Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov
Subjects: cs.LG, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.10488
Pdf URL: https://arxiv.org/pdf/2503.10488
Copy Paste: [[2503.10488]] Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion(https://arxiv.org/abs/2503.10488)
Keywords: diffusion
Abstract: Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.

Title: Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

Authors: Hanxu Hu, Jannis Vamvas, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10494
Pdf URL: https://arxiv.org/pdf/2503.10494
Copy Paste: [[2503.10494]] Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents(https://arxiv.org/abs/2503.10494)
Keywords: large language model
Abstract: LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We further propose a `source-primed' method that first provides the whole source document before multi-turn translation. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.

Title: MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Authors: Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Yun Xing, Junjue Wang, Huitao Li, Xin Li, Kunyu Yu, Nan Liu, Qingyu Chen, Douglas Teodoro, Edison Marrese-Taylor, Shijian Lu, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10497
Pdf URL: https://arxiv.org/pdf/2503.10497
Copy Paste: [[2503.10497]] MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation(https://arxiv.org/abs/2503.10497)
Keywords: large language model
Abstract: Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

Title: OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Authors: Jiali Yao, Xinran Deng, Xin Gu, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10500
Pdf URL: https://arxiv.org/pdf/2503.10500
Copy Paste: [[2503.10500]] OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding(https://arxiv.org/abs/2503.10500)
Keywords: transformer
Abstract: In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at this https URL.

Title: TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Authors: Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10501
Pdf URL: https://arxiv.org/pdf/2503.10501
Copy Paste: [[2503.10501]] TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models(https://arxiv.org/abs/2503.10501)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at this https URL.

Title: Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction

Authors: Yuhan Wang, Cheng Liu, Daou Zhang, Weichao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10508
Pdf URL: https://arxiv.org/pdf/2503.10508
Copy Paste: [[2503.10508]] Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction(https://arxiv.org/abs/2503.10508)
Keywords: explainability, generative
Abstract: In the domain of Image Anomaly Detection (IAD), Existing methods frequently exhibit a paucity of fine-grained, interpretable semantic information, resulting in the detection of anomalous entities or activities that are susceptible to machine illusions. This deficiency often leads to the detection of anomalous entities or actions that are susceptible to machine illusions and lack sufficient explanation. In this thesis, we propose a novel approach to anomaly detection, termed Hoi2Anomaly, which aims to achieve precise discrimination and localization of anomalies. The proposed methodology involves the construction of a multi-modal instruction tuning dataset comprising human-object interaction (HOI) pairs in anomalous scenarios. Second, we have trained an HOI extractor in threat scenarios to localize and match anomalous actions and entities. Finally, explanatory content is generated for the detected anomalous HOI by fine-tuning the visual language pretraining (VLP) framework. The experimental results demonstrate that Hoi2Anomaly surpasses existing generative approaches in terms of precision and explainability. We will release Hoi2Anomaly for the advancement of the field of anomaly detection.

Title: SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models

Authors: Sahar Admoni, Omer Ben-Porat, Ofra Amir
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10509
Pdf URL: https://arxiv.org/pdf/2503.10509
Copy Paste: [[2503.10509]] SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models(https://arxiv.org/abs/2503.10509)
Keywords: large language model
Abstract: Policies generated by Reinforcement Learning (RL) algorithms can be difficult to describe to users, as they result from the interplay between complex reward structures and neural network-based representations. This combination often leads to unpredictable behaviors, making policies challenging to analyze and posing significant obstacles to fostering human trust in real-world applications. Global policy summarization methods aim to describe agent behavior through a demonstration of actions in a subset of world-states. However, users can only watch a limited number of demonstrations, restricting their understanding of policies. Moreover, those methods overly rely on user interpretation, as they do not synthesize observations into coherent patterns. In this work, we present SySLLM (Synthesized Summary using LLMs), a novel method that employs synthesis summarization, utilizing large language models' (LLMs) extensive world knowledge and ability to capture patterns, to generate textual summaries of policies. Specifically, an expert evaluation demonstrates that the proposed approach generates summaries that capture the main insights generated by experts while not resulting in significant hallucinations. Additionally, a user study shows that SySLLM summaries are preferred over demonstration-based policy summaries and match or surpass their performance in objective agent identification tasks.

Title: Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

Authors: Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, Janaradhan Rao Doppa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10512
Pdf URL: https://arxiv.org/pdf/2503.10512
Copy Paste: [[2503.10512]] Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression(https://arxiv.org/abs/2503.10512)
Keywords: generative, large language model
Abstract: We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.

Title: Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

Authors: Florian Eichin, Yang Janet Liu, Barbara Plank, Michael A. Hedderich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10515
Pdf URL: https://arxiv.org/pdf/2503.10515
Copy Paste: [[2503.10515]] Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set(https://arxiv.org/abs/2503.10515)
Keywords: large language model
Abstract: Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.

Title: CountPath: Automating Fragment Counting in Digital Pathology

Authors: Ana Beatriz Vieira, Maria Valente, Diana Montezuma, Tomé Albuquerque, Liliana Ribeiro, Domingos Oliveira, João Monteiro, Sofia Gonçalves, Isabel M. Pinto, Jaime S. Cardoso, Arlindo L. Oliveira
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10520
Pdf URL: https://arxiv.org/pdf/2503.10520
Copy Paste: [[2503.10520]] CountPath: Automating Fragment Counting in Digital Pathology(https://arxiv.org/abs/2503.10520)
Keywords: transformer
Abstract: Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the slides contain the appropriate diagnostic material from the grossing process, thereby guaranteeing the accuracy of subsequent microscopic examination and diagnosis. Traditionally, this assessment is performed manually, requiring significant time and effort while being subject to significant variability due to its subjective nature. To address these challenges, this study explores an automated approach to fragment counting using the YOLOv9 and Vision Transformer models. Our results demonstrate that the automated system achieves a level of performance comparable to expert assessments, offering a reliable and efficient alternative to manual counting. Additionally, we present findings on interobserver variability, showing that the automated approach achieves an accuracy of 86%, which falls within the range of variation observed among experts (82-88%), further supporting its potential for integration into routine pathology workflows.

Title: NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

Authors: Zengrong Lin, Zheng Wang, Tianwen Qian, Pan Mu, Sixian Chan, Cong Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10526
Pdf URL: https://arxiv.org/pdf/2503.10526
Copy Paste: [[2503.10526]] NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval(https://arxiv.org/abs/2503.10526)
Keywords: robust
Abstract: Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: this https URL .

Title: PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

Authors: Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10529
Pdf URL: https://arxiv.org/pdf/2503.10529
Copy Paste: [[2503.10529]] PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models(https://arxiv.org/abs/2503.10529)
Keywords: generative, large language model
Abstract: 3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.

Title: Lightweight Models for Emotional Analysis in Video

Authors: Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10530
Pdf URL: https://arxiv.org/pdf/2503.10530
Copy Paste: [[2503.10530]] Lightweight Models for Emotional Analysis in Video(https://arxiv.org/abs/2503.10530)
Keywords: extraction
Abstract: In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.

Title: The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Authors: Robin Schmucker, Steven Moore
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.10533
Pdf URL: https://arxiv.org/pdf/2503.10533
Copy Paste: [[2503.10533]] The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory(https://arxiv.org/abs/2503.10533)
Keywords: robust
Abstract: High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT parameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice questions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors). Overall, while IWFs are useful for predicting IRT parameters--particularly for screening low-difficulty MCQs--they cannot replace traditional data-driven validation methods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.

Title: DP-GPL: Differentially Private Graph Prompt Learning

Authors: Jing Xu, Franziska Boenisch, Iyiola Emmanuel Olatunji, Adam Dziedzic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10544
Pdf URL: https://arxiv.org/pdf/2503.10544
Copy Paste: [[2503.10544]] DP-GPL: Differentially Private Graph Prompt Learning(https://arxiv.org/abs/2503.10544)
Keywords: privacy, attack, membership infer
Abstract: Graph Neural Networks (GNNs) have shown remarkable performance in various applications. Recently, graph prompt learning has emerged as a powerful GNN training paradigm, inspired by advances in language and vision foundation models. Here, a GNN is pre-trained on public data and then adapted to sensitive tasks using lightweight graph prompts. However, using prompts from sensitive data poses privacy risks. In this work, we are the first to investigate these practical risks in graph prompts by instantiating a membership inference attack that reveals significant privacy leakage. We also find that the standard privacy method, DP-SGD, fails to provide practical privacy-utility trade-offs in graph prompt learning, likely due to the small number of sensitive data points used to learn the prompts. As a solution, we propose DP-GPL for differentially private graph prompt learning based on the PATE framework, that generates a graph prompt with differential privacy guarantees. Our evaluation across various graph prompt learning methods, GNN architectures, and pre-training strategies demonstrates that our algorithm achieves high utility at strong privacy, effectively mitigating privacy concerns while preserving the powerful capabilities of prompted GNNs as powerful foundation models in the graph domain.

Title: MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup

Authors: Youngjin Kwon, Xiao Zhang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10549
Pdf URL: https://arxiv.org/pdf/2503.10549
Copy Paste: [[2503.10549]] MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup(https://arxiv.org/abs/2503.10549)
Keywords: privacy, protect, robust, diffusion, generative
Abstract: As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.

Title: ASIDE: Architectural Separation of Instructions and Data in Language Models

Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10566
Pdf URL: https://arxiv.org/pdf/2503.10566
Copy Paste: [[2503.10566]] ASIDE: Architectural Separation of Instructions and Data in Language Models(https://arxiv.org/abs/2503.10566)
Keywords: attack, large language model
Abstract: Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations.

Title: FedPCA: Noise-Robust Fair Federated Learning via Performance-Capacity Analysis

Authors: Nannan Wu, Zengqiang Yan, Nong Sang, Li Yu, Chang Wen Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10567
Pdf URL: https://arxiv.org/pdf/2503.10567
Copy Paste: [[2503.10567]] FedPCA: Noise-Robust Fair Federated Learning via Performance-Capacity Analysis(https://arxiv.org/abs/2503.10567)
Keywords: robust, federate, fair
Abstract: Training a model that effectively handles both common and rare data-i.e., achieving performance fairness-is crucial in federated learning (FL). While existing fair FL methods have shown effectiveness, they remain vulnerable to mislabeled data. Ensuring robustness in fair FL is therefore essential. However, fairness and robustness inherently compete, which causes robust strategies to hinder fairness. In this paper, we attribute this competition to the homogeneity in loss patterns exhibited by rare and mislabeled data clients, preventing existing loss-based fair and robust FL methods from effectively distinguishing and handling these two distinct client types. To address this, we propose performance-capacity analysis, which jointly considers model performance on each client and its capacity to handle the dataset, measured by loss and a newly introduced feature dispersion score. This allows mislabeled clients to be identified by their significantly deviated performance relative to capacity while preserving rare data clients. Building on this, we introduce FedPCA, an FL method that robustly achieves fairness. FedPCA first identifies mislabeled clients via a Gaussian Mixture Model on loss-dispersion pairs, then applies fairness and robustness strategies in global aggregation and local training by adjusting client weights and selectively using reliable data. Extensive experiments on three datasets demonstrate FedPCA's effectiveness in tackling this complex challenge. Code will be publicly available upon acceptance.

Title: Radar: Fast Long-Context Decoding for Any Transformer

Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10571
Pdf URL: https://arxiv.org/pdf/2503.10571
Copy Paste: [[2503.10571]] Radar: Fast Long-Context Decoding for Any Transformer(https://arxiv.org/abs/2503.10571)
Keywords: transformer
Abstract: Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Title: Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

Authors: Afrar Jahin, Arif Hassan Zidan, Yu Bao, Shizhe Liang, Tianming Liu, Wei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10573
Pdf URL: https://arxiv.org/pdf/2503.10573
Copy Paste: [[2503.10573]] Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models(https://arxiv.org/abs/2503.10573)
Keywords: large language model
Abstract: With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs' mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs' mathematical reasoning and lays the groundwork for future advancements

Title: VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Authors: Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10582
Pdf URL: https://arxiv.org/pdf/2503.10582
Copy Paste: [[2503.10582]] VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search(https://arxiv.org/abs/2503.10582)
Keywords: extraction
Abstract: Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

Title: Unlock the Power of Unlabeled Data in Language Driving Model

Authors: Chaoqun Wang, Jie Yang, Xiaobin Hong, Ruimao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10586
Pdf URL: https://arxiv.org/pdf/2503.10586
Copy Paste: [[2503.10586]] Unlock the Power of Unlabeled Data in Language Driving Model(https://arxiv.org/abs/2503.10586)
Keywords: large language model
Abstract: Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Title: The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10587
Pdf URL: https://arxiv.org/pdf/2503.10587
Copy Paste: [[2503.10587]] The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity(https://arxiv.org/abs/2503.10587)
Keywords: robust
Abstract: Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

Title: Long Context Tuning for Video Generation

Authors: Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10589
Pdf URL: https://arxiv.org/pdf/2503.10589
Copy Paste: [[2503.10589]] Long Context Tuning for Video Generation(https://arxiv.org/abs/2503.10589)
Keywords: diffusion, transformer
Abstract: Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See this https URL for more details.

Title: CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Authors: Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10592
Pdf URL: https://arxiv.org/pdf/2503.10592
Copy Paste: [[2503.10592]] CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models(https://arxiv.org/abs/2503.10592)
Keywords: diffusion, generative
Abstract: This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.

Title: GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Authors: Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10596
Pdf URL: https://arxiv.org/pdf/2503.10596
Copy Paste: [[2503.10596]] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding(https://arxiv.org/abs/2503.10596)
Keywords: segmentation
Abstract: Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.

Title: TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention

Authors: Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10602
Pdf URL: https://arxiv.org/pdf/2503.10602
Copy Paste: [[2503.10602]] TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention(https://arxiv.org/abs/2503.10602)
Keywords: large language model
Abstract: Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at this https URL.

Title: Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation

Authors: Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, Xilong Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10603
Pdf URL: https://arxiv.org/pdf/2503.10603
Copy Paste: [[2503.10603]] Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation(https://arxiv.org/abs/2503.10603)
Keywords: robust
Abstract: Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40\%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.

Title: MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Authors: Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10604
Pdf URL: https://arxiv.org/pdf/2503.10604
Copy Paste: [[2503.10604]] MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction(https://arxiv.org/abs/2503.10604)
Keywords: robust, diffusion
Abstract: Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

Title: OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction

Authors: Severin Heidrich, Till Beemelmanns, Alexey Nekrasov, Bastian Leibe, Lutz Eckstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10605
Pdf URL: https://arxiv.org/pdf/2503.10605
Copy Paste: [[2503.10605]] OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction(https://arxiv.org/abs/2503.10605)
Keywords: robust
Abstract: Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at this https URL .

Title: CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Authors: Advait Gupta, NandaKiran Velaga, Dang Nguyen, Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10613
Pdf URL: https://arxiv.org/pdf/2503.10613
Copy Paste: [[2503.10613]] CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing(https://arxiv.org/abs/2503.10613)
Keywords: diffusion, large language model
Abstract: Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

Title: ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer

Authors: Bolin Chen, Baoquan Zhao, Haoran Xie, Yi Cai, Qing Li, Xudong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10614
Pdf URL: https://arxiv.org/pdf/2503.10614
Copy Paste: [[2503.10614]] ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer(https://arxiv.org/abs/2503.10614)
Keywords: diffusion
Abstract: Style transfer involves transferring the style from a reference image to the content of a target image. Recent advancements in LoRA-based (Low-Rank Adaptation) methods have shown promise in effectively capturing the style of a single image. However, these approaches still face significant challenges such as content inconsistency, style misalignment, and content leakage. In this paper, we comprehensively analyze the limitations of the standard diffusion parameterization, which learns to predict noise, in the context of style transfer. To address these issues, we introduce ConsisLoRA, a LoRA-based method that enhances both content and style consistency by optimizing the LoRA weights to predict the original image rather than noise. We also propose a two-step training strategy that decouples the learning of content and style from the reference image. To effectively capture both the global structure and local details of the content image, we introduce a stepwise loss transition strategy. Additionally, we present an inference guidance method that enables continuous control over content and style strengths during inference. Through both qualitative and quantitative evaluations, our method demonstrates significant improvements in content and style consistency while effectively reducing content leakage.

Title: R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Authors: Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10615
Pdf URL: https://arxiv.org/pdf/2503.10615
Copy Paste: [[2503.10615]] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization(https://arxiv.org/abs/2503.10615)
Keywords: robust, large language model
Abstract: Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Title: OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer

Authors: Jinyang Li, En Yu, Sijia Chen, Wenbing Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10616
Pdf URL: https://arxiv.org/pdf/2503.10616
Copy Paste: [[2503.10616]] OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer(https://arxiv.org/abs/2503.10616)
Keywords: protect, transformer
Abstract: Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at this https URL.

Title: Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models

Authors: Andy Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10617
Pdf URL: https://arxiv.org/pdf/2503.10617
Copy Paste: [[2503.10617]] Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models(https://arxiv.org/abs/2503.10617)
Keywords: large language model
Abstract: Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.

Title: DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Authors: Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10618
Pdf URL: https://arxiv.org/pdf/2503.10618
Copy Paste: [[2503.10618]] DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation(https://arxiv.org/abs/2503.10618)
Keywords: diffusion, transformer
Abstract: In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.

Title: From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

Authors: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F.T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10620
Pdf URL: https://arxiv.org/pdf/2503.10620
Copy Paste: [[2503.10620]] From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM(https://arxiv.org/abs/2503.10620)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

Title: DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Authors: Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10621
Pdf URL: https://arxiv.org/pdf/2503.10621
Copy Paste: [[2503.10621]] DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding(https://arxiv.org/abs/2503.10621)
Keywords: robust
Abstract: While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at this https URL.

Title: Transformers without Normalization

Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10622
Pdf URL: https://arxiv.org/pdf/2503.10622
Copy Paste: [[2503.10622]] Transformers without Normalization(https://arxiv.org/abs/2503.10622)
Keywords: transformer
Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Title: LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

Authors: Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10625
Pdf URL: https://arxiv.org/pdf/2503.10625
Copy Paste: [[2503.10625]] LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds(https://arxiv.org/abs/2503.10625)
Keywords: transformer
Abstract: Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

Title: NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Authors: Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10626
Pdf URL: https://arxiv.org/pdf/2503.10626
Copy Paste: [[2503.10626]] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models(https://arxiv.org/abs/2503.10626)
Keywords: diffusion, transformer, generative
Abstract: Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.

Title: Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

Authors: Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10629
Pdf URL: https://arxiv.org/pdf/2503.10629
Copy Paste: [[2503.10629]] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology(https://arxiv.org/abs/2503.10629)
Keywords: attack, robust
Abstract: Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at this https URL.

Title: UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Authors: Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10630
Pdf URL: https://arxiv.org/pdf/2503.10630
Copy Paste: [[2503.10630]] UniGoal: Towards Universal Zero-shot Goal-oriented Navigation(https://arxiv.org/abs/2503.10630)
Keywords: robust, large language model
Abstract: In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.

Title: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Authors: Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10631
Pdf URL: https://arxiv.org/pdf/2503.10631
Copy Paste: [[2503.10631]] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model(https://arxiv.org/abs/2503.10631)
Keywords: robust, diffusion, large language model
Abstract: Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.

Title: Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

Authors: Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10632
Pdf URL: https://arxiv.org/pdf/2503.10632
Copy Paste: [[2503.10632]] Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?(https://arxiv.org/abs/2503.10632)
Keywords: transformer
Abstract: Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: this https URL

Title: V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

Authors: Yanming Zhang, Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10634
Pdf URL: https://arxiv.org/pdf/2503.10634
Copy Paste: [[2503.10634]] V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes(https://arxiv.org/abs/2503.10634)
Keywords: robust, diffusion
Abstract: This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

Title: A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

Authors: Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10635
Pdf URL: https://arxiv.org/pdf/2503.10635
Copy Paste: [[2503.10635]] A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1(https://arxiv.org/abs/2503.10635)
Keywords: attack
Abstract: Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at this https URL.

Title: Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

Authors: Xiaoming Zhao, Alexander G. Schwing
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10638
Pdf URL: https://arxiv.org/pdf/2503.10638
Copy Paste: [[2503.10638]] Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective(https://arxiv.org/abs/2503.10638)
Keywords: diffusion
Abstract: Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.

Title: GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10639
Pdf URL: https://arxiv.org/pdf/2503.10639
Copy Paste: [[2503.10639]] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing(https://arxiv.org/abs/2503.10639)
Keywords: diffusion, large language model
Abstract: Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.